Next Article in Journal
Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence
Previous Article in Journal
Analytical Solutions to Temperature Field in Various Relative-Scale Media Subjected to a Reciprocating Motion Point Heat Source
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Construction of Disease-Symptom Knowledge Graph from Web-Board Documents

by
Chaveevan Pechsiri
1,* and
Rapepun Piriyakul
2
1
College of Innovative Technology and Engineering, Dhurakij Pundit University, Bangkok 10210, Thailand
2
Department of Computer Science, Ramkhamhaeng University, Bangkok 10240, Thailand
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(13), 6615; https://doi.org/10.3390/app12136615
Submission received: 27 May 2022 / Revised: 16 June 2022 / Accepted: 22 June 2022 / Published: 29 June 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:

Featured Application

Authors are encouraged to provide a concise description of the specific application or a potential application of the work. This section is not mandatory.

Abstract

The research aim is to construct a disease-symptom knowledge graph (DSKG) as a cause-effect knowledge graph containing disease-symptom relations as a cause-effect relation type determined from downloaded documents on medical web-board resources. Each disease-symptom relation connects a disease-name concept node (a causative-concept node) to a corresponding node having a group of correlated symptom-concept/effect-concept features as common symptom-concept/effect-concept features among some disease-name concepts. The DSKG benefits non-professionals in preliminary diagnosis through a recommender web-board. There are three main problems: how to determine symptom concepts from sentences without annotation on the documents having disease-name concepts as the documents’ topic-names; how to determine the disease-symptom relations from the documents with/without complications; and how to construct the DSKG involving high dimensional symptom-concept features after union of the correlated symptom-concept groups. Therefore, we apply a word co-occurrence pattern including medical-symptom expressions from Wikipedia including MeSH and the Lexitron Dictionary to determine the symptom concepts. The Cartesian product is applied for automatic-supervised machine learning to determine the disease-symptom relation. We propose using Principal Component Analysis for constructing the DSKG by dimensionality reduction in the symptom-concept features with minimized information loss. In contrast to previous works, the proposed approach enables the DSKG construction with precise and concise representation scores of 7.8 and 9, respectively.

1. Introduction

The research objective is to construct of a disease-symptom knowledge graph (DSKG) automatically from determined disease-symptom relations on documents downloaded from two medical web-board resources, a Thai-hospital web-board resource (e.g., http://www.si.mahidol.ac.th/sidoctor/e-pl/ (accessed on 15 February 2021); https://www.bumrungrad.com (accessed on 15 February 2021); and etc.) and a Thai-Healthcare-Knowledge web-board resource (e.g., http://haamor.com/ (accessed on 15 February 2021); and https://www.doctor.or.th/doctorme/general (accessed on 15 February 2021)). The DSKG is formed as a cause-effect knowledge graph that represents disease-symptom relations which are a cause-effect relation type between causative concept nodes and effect concept nodes, where each causative concept node is represented by a certain disease-name concept, and each effect concept node is represented by a correlated symptom-concept group. With regard to [1], most of the patients with a certain disease have multiple symptoms rather than a single symptom, and some of these multiple symptoms are also either correlated or represent co-occurrences as common symptoms among some diseases; e.g., stuffy, runny nose, and cough symptoms are common symptoms among Cold, Flu, Airborne Allergy, and COVID-19 diseases (https://newsinhealth.nih.gov/2022/01/it-flu-covid-19-allergies-or-cold (accessed on 15 February 2021). The DSKG presents several disease-symptom relations determined or extracted from the downloaded documents on which each disease-symptom relation is a link connecting a disease-name concept node to a node containing a symptom-concept group as an occurrence of multiple symptoms. Furthermore, this disease-symptom relation is a semantic relation, particularly the cause-effect relation type (called CErel), which links each disease-name concept (di; i = 1, 2, …, numofDiseaseElements) as the causative concept to the corresponding symptom-concept group (SymGroupdi) as the effect-concept group resulted by di from the di document (which is the downloaded disease document having di as the document topic name), where: di ∈ DS is a disease-name concept set; SymGroupdi ⊂ SG is a set of symptom-concept groups resulted by the corresponding DS elements. Moreover, the symptom concepts of our research include sign concepts and also are the basic symptoms of each di. In addition to CErel, the semantic relation is a directional link between two or more concepts, entities, or sets of entities that participate in the relation [2] as follows:
<Concept1>---(Relation)---<Concept2>
where “<…>” and “(…)” symbols represent a concept and a relation type, respectively.
Thus, CErel is the relation type which links Concept1, e.g., di, to Concept2, e.g., SymGroupdi, as follows:
<di>---(CErel)----> <SymGroupdi>.
where SymGroupdi = {Symi1, Symi2, …, Symi last_i}of_di; SYM is the universal symptom-concept set, and then Symij ∈ SYM; i = 1, 2, …, numofDiseaseElements; j = 1, 2, …, last_i; SymGroupd1 ∪ SymGroupd2… ∪ SymGroupd numofDiseaseElements ⊂ SYM; and the result of SymGroupdi ∩ SymGroupdl is either not null or null on which i<>l and 1 ≤ lnumofDisease Elements. Moreover, the concepts and the relations are the foundation of knowledge and thought [2] where the concepts are the building blocks of knowledge, and the relations are the cement linking up the concepts into the knowledge structures. According to the knowledge structure, the DSKG is formed by the CErel connections which connect several different <di> nodes to a node containing several correlated sc features (where sc is a symptom-concept expressed on the documents; sc ∈ S is obtained by the union of all subsets of SG or all SymGroupdi from all di documents; i = 1, 2, …, num is numofDiseaseElements; S ⊂ SYM; S = {sc}; and c is an index, c = 1, 2, ..., m which is the number of symptom-concept features; see Figure 1).
With regards to Figure 1, the DSKG also presents a <s1, s8, …, sβ> node as the common symptom-concept features among di nodes.
In addition, identifying the symptoms in terms of the symptom concepts is vital towards diagnostics of diseases in the medical field. Approximately 70–90% of the diagnostic information is comprised of a patient’s history and physical examinations that involve costly physical tests [1]. Although about one third of the identifiable common symptom concepts do not provide a conclusive disease-based explanation [1], the DSKG of our research can be used as an additional healthcare procedure for preliminary diagnosis of some potential diseases during the diagnostic processes which potentially reduces physical examination costs. Moreover, the DSKG will help healthcare practitioners to avoid tunnel visioning and maintain awareness on the presence of multiple symptom concepts instead. The DSKG can also be expanded beyond its use by healthcare practitioners to non-professionals in preliminary diagnosis of the possible diseases from some actual symptom occurrences as a web application system containing a healthcare recommender application using the DSKG on their mobile phones or computers.
Thus, the research focuses on constructing the DSKG from the determined cause-effect pairs, i.e., the di-SymGroupdi pairs, having CErel from the downloaded disease documents on the Thai medical web-board resources. The symij element expressed on the di document is mostly based on an event expression on an Elementary Discourse Unit (EDU which is defined as a simple sentence or a clause by [3]). In addition, the event expression is explained by a verb with the event semantic [4] on the EDU’s verb phrase, where each EDU expression is based on a general linguistic expression, e.g., a general Thai linguistic expression (see Figure 2), after stemming words and completing stop-word elimination.
In Figure 2, a concept of each element (called “an element concept”) in the Verbweak, Verbstrong, Adv, Adj, and Noun sets is based on the medical-symptom-expression list of Wikipedia (https://en.wikipedia.org/wiki/List_of_medical_symptoms (accessed on 10 January 2022) and MeSH (https//:www.ncbi.nlm.nih.gov/mesh (accessed on 15 February 2021), after translating from English to Thai by the Lexitron Dictionary (https://dict.longdo.com (accessed on 15 February 2021) followed by the Thai to English translation by the Lexitron Dictionary and WordNet [5] (https//:word-net.princeton.edu/obtain (accessed on 15 February 2021). An example of a downloaded disease document is shown in Figure 3 which contains seven different symptom-concept expressions based on verb phrases of EDU2-EDU7 and EDU11.
There are several techniques in the literature [6,7,8,9,10,11,12,13,14,15] applied for determining the cause-effect/disease-symptom relation from the unstructured data, e.g., texts, without constructing the cause-effect/disease-symptom graph or network except [7,12,13,14,15] (see Section 2) where each cause-effect/disease-symptom relation in the cause-effect/disease-symptom graph or network of [7,12,13,14,15] is based on a causative-concept feature, e.g., a disease-name concept feature, connecting to one effect-concept feature, e.g., a symptom-concept feature. In contrast, unlike the aforementioned literature, our DSKG is constructed by several CErel connections where each CErel connection is the link between one causative-concept feature, e.g., a di feature, and a group of correlated effect-concept features (e.g., a group of the sn1, sn2, …, and snη features which are the correlated sc features as the common symptom-concept features) where n1, n2, …, and nη are the index (c) values; 0 < n1 < n2 < … < nη; η is numberOfCorrelatedSymtomConcepts; and ηm is the number of symptom-concept features (see Figure 1).
In addition to [12,13,14,15], determination of the CErel or the disease-symptom relation from the documents involves a disease-name concept feature set and a symptom-concept feature set where their symptom-concept feature elements are mostly expressed by at least one term/word on NP1 or NP2 of a simple sentence (see Section 2). Whilst there is another research [16] working on only a symptom recognition from documents without determining the disease-symptom relation where their symptoms are based on either NP1 or VP. In contrast, the symptom-concept features of our DSKG construction are mostly expressed by at least two terms/words as a composite variable [17] on an EDU’s verb phrase (VP) including NP1 of the EDU after stemming words and stopping word removal, e.g., “((คอ/throat)/NP1 ((เจ็บ/hurt)/Verbstrong)/VP” (a sore throat), (ศรีษะ/head)/NP1 ((มี/has)/Verbweak (อาการ/sympyom)/noun (บวม/swelling)/NP2)/VP (a swollen head), and “(คนไข้/patient)/NP1 ((ปวด/pain)/Verbstrong (ศรีษะ/head)/NP2)/VP” (The patient has a headache)). The composite variable is a variable made up of two or more individual variables, called indicators, into a single variable [17]. Each indicator alone does not provide sufficient information, but altogether they can represent the more complex concept. The composite variable of the symptom-concept feature of our research consists of one or two terms from the EDU’s VP and one term from the EDU’s NP1 for obtaining a symptom/effect concept.
However, the Thai documents have some specific characteristics, such as zero anaphora or implicit noun phrases, without word and sentence delimiters, etc. All of these characteristics are involved in three main problems for constructing the DSKG from the documents: (1) how to determine a symptom concept of an EDUh occurrence as EDUh,Symij (which is an EDUh occurrence with a symptom concept, Symij, on a di document, h = 1, 2, .., endDocument_di) among several EDUh occurrences with non-symptom concepts on the di document; (2) how to determine CErel between di and each SymGroupdi element (Symij of di) from a downloaded di document for subsequently determining a di-SymGroupdi pair with Cerel, where SymGroupdi appears as a symptom-concept EDU vector, i.e., 〈EDUh1,Symi1, EDUh2,Symi2, ..., EDUhrim,Symi last_iof_di, on the di document (0 < h1 < h2 < … < hrimendDocument_di), whilst some downloaded di documents contain di symptom concepts of the primary disease with/without other symptom concepts resulted by complications as the secondary disease; and (3) how to construct the DSKG based on each <di> node (which is represented by the di feature), connecting to the <sn1, sn2, …, snη> node (which is represented by a group of the correlated sc features) with the problem of the high dimensional feature space of sc, where sc ∈ S is obtained by the union of all SymGroupdi from the determined di-SymGroupdi pairs having CErel. S then has the high dimensional feature space [18] resulting in a time consuming way to find the correlated sc features for the DSKG construction.
We then need to develop a framework which combines the statistical techniques, the machine learning techniques, and the linguistic phenomena to learn several EDU expressions for solving the research problems. With regard to Figure 3, most of the symptom-concept occurrences on the documents are based on Verb or VP (see Figure 2). Therefore, we apply a word co-occurrence (called “wc”) pattern on an EDU occurrence where a wc pattern consists of three major term sets on an EDU occurrence, a predicate-verb term set, an agent term set, and a patient/information term set (see Section 3.1). The word co-occurrence [19,20] is expressed as a compound term with/without any pattern or restriction depending on each research perspective, whilst the wc pattern of our research is used for obtaining an EDU’s wc expression as a composite variable along with determining a wc concept, particularly a symptom concept, which is called “wcSym” of a wc expression on an EDU occurrence. Thus, the symptom-concept feature, Symij, occurring on the EDUh,Symij occurrence of the di document, is represented by wcSymij of the EDUh,Symij’s wc expression without the concept annotation. With regard to the di document, wcSymij and di are then used as a wcSymij feature and a di feature, respectively, for determining a di-SymGroupdi pair with CErel where the concepts of wcSymij and di are based on the Verbweak, Verbstrong, Adv, Adj, and Noun sets (on Figure 2) prepared and collected from the medical-symptom-expression list of Wikipedia and MeSH after the English-Thai translations. Likewise, three contributions of this paper proved a statistically based approach involving machine learning. The first one is how to determine a SymGroupdi element, Symij, based on the wc pattern used for determining the wcSymij feature as the composite variable without the concept annotation, whereas the symptom concepts in previous research, e.g., [13,15], are based on only NP1/NP2, whilst NP1 is likely an ellipsis (or NP1 has a null value) in our documents. The second one is how to determine di-wcSymij pairs having CErel by machine learning techniques with automatic-supervised learning automatically where the positive/negative instances are formed by the Cartesian product of DS × SG [21] with consideration of the disease type (t or Typet); see Section 3.2. According to the DSKG, each wcSymt,ij feature is based on the basic symptom of dt,i where dt,i is di in Typet, and wcSymt,ij is wcSymij of dt,i. Thus, the symptoms of the complications are excluded by the complicationTerm set, {‘ภาวะแทรกซ้อน/complcations’, ‘ไม่รักษา/unTreat’, ...}. And the third one is how to construct the DSKG by clustering or wrapping the determined S elements to become the group of the correlated sc features as the dimensionality reduction of the feature space of S with minimized information loss. Thus, the constructed DSKG diagram also presents the <sn1, sn2, …, snη> node as the multi-symptom concept node among some <di> nodes.
Therefore, we apply the wc pattern to obtain the wc expression of EDUh,Symij where the wc pattern is relied upon a predicate-argument pattern [22] (see Section 3.1) The wcSymij feature is determined from the wc expression by the elements in the Verbweak, Verbstrong, Adv, Adj, and Noun sets collected from the medical-symptom-expression list of Wikipedia including MeSH through the English-Thai translation without the concept annotation (see Section 3.1 and ii in Section 4.2). Moreover, all symptoms of the complications are excluded if the complications occur right after the complicationTerm set element. We also apply the machine learning techniques: Support Vector Machine (SVM) [23], Naïve Bayes (NB) [24], and Linear Logistic Regression (LR) [25] to determine dt,i-wcSymt,ij pairs having CErel by the automatic-supervised learning from the result of the Cartesian product of DS × SG aligned with the disease type (t or Typet) on the learning corpus (see Section 3.2). Each dt,i-SymGroupdt,i pair with CErel of Typet is subsequently solved by grouping the determined dt,i-wcSymt,ij pairs having CErel with the same dt,i from the test corpus. We then propose using the principal component analysis (PCA) [26] to solve the high dimensional feature space of S by wrapping the S elements to become the group of the correlated sc features as the common features for constructing the DSKG (see Section 3.3).
Our research is organized into six sections. In Section 2, related works are summarized. Research problems in constructing the DSKG from the documents are described in Section 3, and Section 4 shows our framework for constructing the DSKG through CErel determination from the documents. In Section 5, we evaluate and discuss our proposed methodology and then present the conclusion in Section 6.

2. Related Works

Several strategies [6,7,8,9,10,11,12,13,14,15] have been proposed to determine the cause-effect/disease-symptom relation from the documents as the unstructured data without concerning the cause-effect/disease-symptom knowledge graph construction except [7,12,13,14,15]. Girju [6] determined a causal relation based on a lexico syntactic pattern (NP1 causal-verb NP2) by decision tree learning. Therefore, the cause/disease and effect/symptom occurrences are based on noun phrases as NP1 and NP2, respectively, with in one sentence. In contrast, the causal relation based on one complex sentence [7] was determined/extracted by using a cue-phrase set (which was a word, a phrase, or a word pattern) for connecting two NP-pair as a cause and an effect including probabilities. The extracted causal relations [7] were used for constructing the causal network as the knowledge graph for the term protein with the two relations of the causal relation and the hypernym relation without concerning the high dimensional effect feature sets. Moreover, Riaz and Girju [8] used a set of linguistic features and Integer Linear Programming to learn a causal relation within one sentence from annotated verbeffect-nouncause pairs on verb phrases as the causal relation based on the expert annotations and FrameNet including WordNet to generate a training corpus. For example: “A campaign has started to try to cut the rising number of children dying [cause from solvent abuse].” where the pair “dying_solvent abuse” encodes causality by the annotation. Reference [9] proposed the Restricted Hidden Naïve Bayes model to the lexico syntactic pattern (NP1 ConnectiveVerb NP2 where NP1 and NP2 are noun phrases as a cause and an effect, respectively, or vice-versa) of each sentence to learn the classes, annotated by experts on 26 feature templates categorized into four feature groups, the contextual, syntactic, positional, and connective features. They extracted/determined the causality with the 0.873 precision and the 0.841 recall from the English documents. Reference [10] also extracted/determined the causal relation within one sentence (where causes and effects were based on noun expressions) by using the linguistic rules based along with Bayesian inference to reduce the number of pairs produced by ambiguous patterns, whilst [11] used syntactic patterns by manual annotation with one sentence or between two sentences having a cause-effect link/relation. The cause-effect links were extracted or determined by a syntactic pattern-based algorithm from scientific papers with 47% and 70% on average precision and recall, respectively. The determined cause-effect links were applied to represent the core of scientific papers as a summarization. Reference [12] extracted disease-symptom relationships from texts by using the syntactic-patterns based on the quality- and specificity-based selection from several determined syntactic-patterns (where each syntactic-pattern was determined on each dependency graph of a sentence containing both a disease entity and a symptom entity based on a noun term or a noun phrase). Reference [13] automatically constructed the health knowledge graphs of a disease-symptom connection by using logistic regression, naive Bayes, and a Bayesian network using noisy OR gates to learn and determine the connection or relation between the disease codes and the symptom concepts from textual data of patient medical records with a 0.85 precision, a 0.6 recall, a 0.704 F1-score with the positive/negative classes based on the actual symptom occurrence on the textual data. However, the symptom expressions on the textual data were based on noun phrases, e.g., “(The patient)/NP1 (complains of (a worsening cough)/NP2)/VP” and “(He)/NP1 (also has (a dry cough)/NP2)/VP.”. Reference [14] determined each disease-symptom relation and also symptom names within one sentence by learning of a multi-column convolutional neural network (MCNN) based on a human annotator from 50,000 random sentences of the Japanese web texts. MCNN had an input sentence which was divided into five consecutive word sequences: a symptom candidate name (SYMname), a given disease name (DISname), the word sequence before the SYMname, the word sequence between the SYMname and the DISname, and the word sequence after the DISname. Their proposed symptom name extraction method achieved a 93.8% F1-score, and the disease–symptom relationship extraction method achieved an 88.3% F1-score where the extracted symptom names were based on noun or noun-phrase expressions. Reference [15] used the PubMed bibliographic literature database and the association between symptoms and diseases in the MeSH metadata fields of PubMed to determine the disease-symptom relationships where symptoms and diseases were based on noun phrases. They applied the term frequency-inverse document frequency to measure the strength of an association between symptomi and diseasej for constructing the disease-symptom network without concerning the high dimensional symptom features. Moreover, Ref. [16] recognized only the medical symptom expressions on patient texts without determining the disease-symptom relation. They applied the sentence/phrase templates based on either a noun phrase or a verb phrase including the symptom concepts labeled by the experts to capture the surface of symptom expressions from the patient text. The machine learning techniques were applied for the multi-label classification of symptoms including the long tail symptoms from the surface of the symptom expressions. The [16] model achieved a 76% F1-score.
However, the causative-concept and effect-concept features of the previous works [6,7,8,9,10,11,12,13,14,15] are based on at least one word of either the verb term expression or the noun-term/noun-phrase expression without the composite variable consideration. The cause-effect relation/association determinations of [6,7,8,9,10,11,12,13,14,15] except [13] are based on machine learning by the expert or human annotation, whereas CErel of our research was based on the automatic supervised learning. According to a few research works [7,12,13,14,15] on the cause-effect graph/network construction from the unstructured data, i.e., texts, their cause-effect graphs/networks are based on each CErel connection between the node of the causative-concept feature and the node of the effect-concept feature without concerning the high dimensional data on the feature set, particularly the effect-concept feature set, even if their corpora are large sizes. However, our DSKG construction (which is based on the CErel connection between the <di> node and the <sn1, sn2, …, snη> node containing the group of the correlated sc features as an occurrence of multiple symptoms) involves the high dimensional feature problem.

3. Problems of DSKG Construction

There are three main problems that must be solved: how to determine the wcSymij features on the EDUh,Symij occurrences without concept annotation on the di documents, how to determine the di-wcSymij pairs having CErel with automatic supervised learning, and how to construct the DSKG based on the correlated sc features with the high dimensional feature problem.

3.1. How to Determine wcSymij Features on EDUh,Symij Occurrences without Concept Annotation on di Documents

According to the corpus behavior study of the health-care domain, most of the symptom-concepts on the EDUh,Symij occurrences are the event or state expressions by the verb phrases where each verb phrase contains a predicate verb/a predicate-verb term (va; a = 1, 2, ..., numberOfpredicate) which is used to identify the EDUh,Symij expressions among EDUh expressions as shown in the following (a)–(d) examples (other than Figure 3) with another problem of having the same concept with different verb phrase expressions resulting in the different wc expressions. These examples also include the phonetic expression by http://translate.google.com/ (accessed on 15 February 2021).
Example 1:
(a)
EDU: “[คนไข้] ปวดกล้ามเนื้อ” ([A patient] has a muscel pain.)
“([(คนไข้/Khnk̄hị̂)/patient])/NP1
(((ปวด/pwd)/pain)/Verbstrong ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun)/VP”
(b)
EDU: “ผู้ป่วยมีอาการปวดกล้ามเนื้อเล็กน้อย” (A patient has a symptom of mild muscle pain.)
“((ผู้ป่วย/P̄hū̂ p̀wy)/patient)/NP1 (((มี/mī)/has)/Verbweak ((อาการ/xākār)/symptom)/Noun ((ปวด/pwd)/pain)/Verbstrong ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun ((เล็กน้อย/lĕkn̂xy)/mild)/Adj)/VP”
(c)
EDU: “ผู้สูงอายุมีอาการเหนื่อย” (An elder has a tired symptom.)
“((ผู้สูงอายุ/P̄hū̂ s̄ūngxāyu)/elder)/NP1 (((มี/)/has)/Verbweak ((อาการ/xākār)/symptom)/Noun ((เหนื่อย/h̄enụ̄̀xy)/be tired)/Verbstrong)/VP”
(d)
EDU: “[ผู้ป่วย] เหนื่อยมาก” ([A patient] is very tired.)
“((ผู้ป่วย/P̄hū̂ p̀wy)/patient)/NP1 (((เหนื่อย/h̄enụ̄̀xy)/be tired)/Verbstrong (มาก/māk)/very)/Adv)/VP”
where: (a) and (b) examples and (c) and (d) examples have different verb phrase expressions with the same major symptom concepts of “(ปวดกล้ามเนื้อ/pwd kl̂ām neụ̄̂x)/pains in muscle” and “(เหนื่อย/h̄enụ̄̀xy)/be tired”, respectively; and the [..] symbol means ellipsis.
With regard to [22], the predicate-argument pattern is shown in the following, verb(agent_argument, patient/information_argument) where verb is an element of a predicate-verb term set; agent_argement is an element of an agent term set; and patient/information_argument is an element of a patient/information term set. According to the predicate-argument pattern, we then apply the following wc pattern on each EDU occurrence based on Figure 2 to obtain the wc expressions for automatically determining the wcSymij features after the EDUh,Symij identification from the di documents by the predicate-verb term set.
wc pattern: V + W1 + W2
where:
V is a predicate-verb term set; V = Verbstrong ∪ Vinf; va ∈ V. Since vweak,b has a weak symptom concept, winf,c as an information word is added to vweak,b to become a strong symptom concept which is an element of Vinf or (vweak,b + winf-c) ∈ Vinf where vweak,b ∈ Verbweak; winf,c ∈ (Noun ∪ Adj ∪ Verbstrong); winf,c is a word right after vweak,b; and a, b, c are an integer or index.
W1 is an agent term set; w1,g ∈ W1; w1,d is a head noun or a Noun element of NP1; and g is an integer or index.
W2 is a linguistic patient/information term set; w2,e ∈ W2; W2 = Noun ∪ Adj ∪ Adv; w2,e is also a word sequence right after va; w2,e has a null value if w2,e does not exist; and e is an integer or index.
Moreover, the concept elements of Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets of Figure 2 are firstly prepared and collected from the results of the Thai-word and Thai-EDU segmentations on the translated terms (English to Thai by the Lexitron Dictionary) of the medical-symptom-expression list on the Wikipedia and MeSH web sites. In addition, the segmented Thai words are then translated from Thai to English by the Lexitron Dictionary and WordNet for collecting a concept of each element in the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets on Figure 2. If the segmented Thai word has several English word senses, the expert will select the corresponding symptom concept in English (see ii in Section 4.2).
According to each corpus of our research, the wcSymij feature (which presents in the form of the predicate-argument pattern) is then determined from each term of the wc expression including the Thai-to-English translation by the collected element concepts of the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets. For the approach of the wc expression for the wcSymij feature determination, the w1,g ellipsis has to be solved by the following rule: w1,g is a Noun element of the previous EDU’s NP1 if the current EDU’s NP1 is an ellipsis. In addition to the problem of the different verb phrase expressions with the same symptom concept, we apply another rule to obtain wc expression with the actual symptom expression: if “vweak,b + winf,c” is “มี/have + อาการ/symptom”, we take the next two words right after “vweak,b + winf,c” to be va and w2,e as the actual symptom expression instead as shown in the following from the previous (b) and (c) in Example 1 including each term translation on the wc expression from Thai to English by using the collected element concepts in the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets.
(b)
V + W1 + W2 = ((ปวด/pwd)/pain)/Verbstrong +([(ผู้ป่วย/P̄hū̂p̀wy)/patient])/Noun + ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun
(c)
V + W1 + W2 = ((เหนื่อย/h̄enụ̄̀xy)/beTired)/Verbstrong + ((ผู้สูงอายุ/P̄hū̂ s̄ūngxāyu)/elder)/Noun + null
Moreover, there are some general-concept rules applied to the acquired wc expressions for the general wcSymij feature presentation: the w1,g concept is “person” if w1,g is in the Person set, {‘คนไข้,ผู้ป่วย/patient’, ‘ผู้สูงอายุ/elder’, ‘ทารก/infant’, ‘เด็ก/child’, ‘วัยรุ่น/teenager’, …}. In addition, if the term in w2,e is a concept of an element in the Symptom-Expression-Level set, {‘มาก/very’, ‘เล็กน้อย/little’, ‘ปานกลาง/moderately’,…}, w2,e has a null value for a general symptom concept. Thus, the wcsymij features of EDUs’ wc expressions of the (a)–(d) in Example 1 are represented by the predicate-argument pattern as follows: (a) pain([person],muscle), (b) pain(person,muscle), (c) beTired(person), (d) beTired([person]) where (a), (b) examples and (c), (d) examples have different EDUs’ verb-phrase expressions but the same symptom concepts of pain(person,muscle) and beTired(person), respectively.
Therefore, after each EDUh,Symij occurrence on the di document has been identified by the predicate-verb term (va) followed by the w1,g, and w2,e of the wc pattern, the wcSymij feature is automatically determined by translating all wc expression terms (va, w1,g, w2,e) from Thai to English by the collected concept elements of the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets.

3.2. How to Determine di-wcSymij Pairs Having CErel with Automatic Supervised Learning

We apply SVM, NB, and LR to learn the di-wcSymij pairs having CErel with the automatic-supervised learning from the learning corpus where the positive and negative instances with the CErel and nonCErel classes, respectively, are assigned by the Cartesian product of DS × SG aligned with the disease types. Thus, the downloaded disease documents are separated into two groups according to two disease types, an infectious disease type (Typet = Type1) and a non-infectious disease type (Typet = Type2). Each disease-type contains the dt,i-SymGroupdt,i pairs having the CErel connections which link the dt,i features to the corresponding SymGroupdt,i features determined from the dt,i documents (where dt,i is di in Typet; SymGroupdt,i is SymGroupdi resulted by dt,i from the dt,i document; and wcSymt,ij is wcSymij of dt,i). Thus, Symt,ij (which is a Symij element in Typet) is a SymGroupdt,i element represented by wcSymt,ij as follows:
SymGroupdt,I = {wcSymt,i1,wcSymt,i2,..,wcSymt,ilast_i}of_dt,i
where I = 1, 2, ..., numofDseaseElementst; j = 1, 2, ..., last_i.
Moreover, there are some downloaded dt,i documents containing both the dt,i symptom expressions and the symptom expressions of the dt,i complications. Therefore, if the element of the complicationTerm set is identified on the dt,i document, then all symptoms, wcSymt,ij, right after the occurrence of the complicationTerm set element, are excluded.
The positive/negative instances of the dt,i wcSymt,ij pairs from each dt,i-SymGroupdt,i pair are formed by the result of the Cartesian product of DS × SG as follows: each dt,i wcSymt,ij pair on the certain dt,i-SymGroupdt,i pair is the positive instance having the CErel or the positive class if dt,i and SymGroupdt,i have the same Typet. Meanwhile, each dt,i wcSymt,i,j pair on the certain dt,i-SymGroupdt,i pair is the negative instance having nonCErel or the negative class based on Typet of the dt,i document if dt,i and SymGroupdt,I have the different Typet (see Figure 4).
DS = {Dt}
where Dt is a disease name set in Typet; t = 1, 2;
Dt = {dt,i}
where i = 1, 2, ..., numofDiseaseElementst;
SG = {SymGrpt}
SymGrpt = {SymGroupdt,i}
where SymGroupdt,i = {wcSymt,i1, wcSymt,i2, .., wcSymt,ilast_i}of_ dt,i.
According to the automatic-supervised learning by NB, SVM, and LR to each disease type from the learning corpus, we then determine the dt,i wcSymt,ij pairs having CErel on the dt,i documents of each disease type from the test corpus. Thus, the dt,i-SymGroupdt,i pair with CErel is determined by grouping all determined dt,i wcSymt,ij pairs having CErel by the same dt,i from the dt,i document. All dt,i-SymGroupdt,i pairs having CErel are then used for constructing the DSKG.

3.3. How to Construct DSKG with the Problem of High Dimensionality of Symptom-Concept Features

After determining the dt,i-SymGroupdt,i pairs having CErel from all downloaded dt,i documents, there are many different sc features from the union of all SymGroupdt,i as a high dimensional feature space of S (or the high dimensional symptom concepts) to form the DSKG from several different disease-name concepts (dt,i). The high dimensional symptom concepts result in difficulty in finding the relevant disease symptom concepts for constructing the concise DSKG. Therefore, we propose using PCA to solve the high dimensionality problems. According to PCA by [27], a variance and a covariance are defined as follows:
The variance of a random variable is defined as
var ( X 1 ) = E ( X 1 2 ) ( E ( X 1 ) ) 2
where (6) shows how variance measures the average deviation from the mean value. When we have more than one random variable, it is useful to analyze the covariance:
cov ( X 1 ,   X 2 ) =   E ( X 1 X 2 )   E ( X 1 ) E ( X 2 )
If the covariance is zero, which is equivalent to saying that the correlation coefficient is zero, the variables are said to be uncorrelated. The variances and covariances of the elements of a random vector x are often connected to a covariance matrix (C(x)) whose a,b-th element is simply the covariance of Xa and Xb:
C ( x ) = cov ( X 1 ,   X 1 )   cov ( X 1 ,   X 2 )   .   .   .   cov ( X 1 ,   X m ) cov ( X 2 ,   X 1 )   cov ( X 2 ,   X 2 )   .   .   .   cov ( X 2 ,   X m )                                    cov ( X m ,   X 1 )   cov ( X m ,   X 2 )   .   .   .   cov ( X m ,   X m )
The diagonal of the covariance matrix gives the variances. The covariance matrix is basically a generalization of variance to random vectors. In addition to our research, the covariance of Xa and Xb features are sa and sb features within the symptom-concept feature matrix of mxm (where m is the number of different symptom-concept features; m > 100). Therefore, 〈s1, s2, …, sm〉 is a symptom-concept feature vector on which sc ∈ S, and c = 1, 2, .., m. The symptom-concept feature vector of the symptom-concept feature matrix is then rotated for grouping the symptom-concept features of the vector into separated feature groups with the minimum number of separated feature groups where each separated feature group is called “Fgroupz”; z = 1, 2, .., numofFeatureGroups and numofFeatureGroups is less than m. After the symptom-concept feature vector rotation, a feature loading weight from an eigenvector for the sc feature is determined according to Fgroupz. The high feature loading weight of sc to Fgroupz infers that the correlation between sc and Fgroupz is high. The different sc feature elements with the high feature loading weights in a certain Fgroupz are wrapped to become a factor (called “Factorz”) including its factor score (called “FactorScorez”) determined by Equation (9) from the feature loading weights of the wrapped sc feature elements.
FactorScore z = c = 1 k z w c x c x ¯ c S D c
where
  • w c is a feature loading weight of s c from an eigenvector in Factor z ;
    s c is a symptom-concept ( w c s y m i j ) feature element within Factor z ;
    c = 1, 2, …, k z ;
    k z is the number of different symptom-concept features in Factor z
    x c is an original value of the number of each s c with its mean, x ¯ c ,
    and standard deviation S D c .
Therefore, the symptom-concept feature space is reduced from m to numofFeatureGroups, which results in a less time consuming way to find the relevant sc feature elements as the common symptom concepts among the certain di occurrences to construct the DSKG (see Section 4.5).

4. System Overview

There are five steps in our framework; Corpus Preparation, Determination of dt,i and wcSymt,ij Features, Automatic Supervised Learning dt,i wcSymt,ij Pairs Having CErel, Determination of dt,i wcSymt,ij Pairs Having CErel for Collecting dt,i-SymGroupdt,iPairWithCErel, and DSKG Construction as shown in Figure 5.

4.1. Corpus Preparation

This step is the preparation of two disease-symptom corpora from the downloaded disease documents having the disease names as the document topic names on two medical web-board resources; one disease-symptom corpus downloaded from the Thai-Healthcare-Knowledge web-board resource is used as the learning corpus, and the other one downloaded from the Thai-hospital web-board resource is used as the test corpus. Each disease-symptom corpus consists of the same disease name concepts from 70 different disease-name documents (or the 70 di documents) on each medical web-board resource where the documents associated with various infections, strokes, kidneys, diabetes, cardio- and vascular diseases are randomly selected from about 700 and 400 different disease-name documents of the Thai-Healthcare-Knowledge web-board resource and the Thai-hospital web-board resource, respectively. The selection of these diseases is motivated by the rapidly increasing number of patient cases in Thailand, e.g., diabetes [28]. This step involves using Thai word segmentation tools [29] including named entity recognition [30,31] on each disease-symptom corpus. After the word segmentation has been achieved on the corpora, EDU segmentation is then dealt with [32,33]. The learning corpus and the test corpus then have 12,000 EDUs and 10,000 EDUs, respectively. With regard to each medical web-board resource, each disease-symptom corpus consists of 33 different disease names in Type1 (or 33 d1,i documents of the infectious disease type) and 37 different disease names in Type2 (or 37 d2,i documents of the non-infectious disease type). Thus, a sample size by random sampling for evaluating the symptom concept determination on the test corpus is about 35 different di documents (which consisted of both disease types) by Equation (10) [34].
  n = N 1 + NE 2
where
  • N is a population size;
    E is Error; n is a sample size;
    If N = 70 and E 10 % n = 35
.
Moreover, all instances from the results of the Cartesian product of DS × SG on the correct symptom-concept determination and the correct disease-name recognition [30,31] from the learning corpus are used for an automatic-supervised learning of the dt,i wcSymt,ij pairs having CErel within each disease type, Typet: Type1 and Type2, based on tenfold cross validation (see Section 4.3). The learning results are used to determine the dt,i wcSymt,ij pairs having CErel from the test corpus, whilst the correct dt,i wcSymt,ij pairs having CErel are collected into the dt,i-SymGroupdt,i pair with CErel having the same dt,i for the DSKG construction.

4.2. Determination of dt,i and wcSymt,ij Features

The objective of this step is to determine the dt,i and wcSymt,ij features from the learning corpus and the test corpus of each disease type (Type1, Type2) for Automatic-Supervised-Learning of dt,i wcSymt,ij Pairs Having CErel in Section 4.3 and Determination of dt,i wcSymt,ij Pairs having CErel for Collecting dt,i-SymGroupdt,iPairWithCErel in Section 4.4, respectively.
i.
Determine dt,i Features
The disease name feature, dt,i, from the dt,i document having di as the topic name in Typet is determined by the named-entity recognition [30,31] in each disease type of both the learning corpus and the test corpus from the previous step of “Corpus Preparation” (Section 4.1) and then the named-entity translation from Thai to English by using the Lexitron dictionary. The determined dt,i features from the learning corpus are then collected into the Dt set for Automatic-Supervised-Learning in Section 4.3.
ii.
Determination of wcsymt,ij Features
  With respect to the wc pattern, it is necessary to prepare and collect the Verbstrong, Verbweak, Noun, Adj, and Adv sets of Figure 2 before determining the wcSymt,ij features as follows:
Preparation and collection of the Verbstrong, Verbweak, Noun, Adj, and Adv sets: Each element concept in the Verbstrong, Verbweak, Noun, Adj, and Adv sets on Figure 2 is prepared and collected from several terms on the medical-symptom-expression list of the Wikipedia and MeSH web sites after translating these terms from English to Thai by the Lexitron dictionary as shown in Table 1.
  From Table 1, some translated terms of the noun expressions on the medical-symptom-expression list (from English to Thai by the Lexitron Dictionary) are presented as EDU or noun-phrase expressions in Thai, for example:
  Example 2. “arrhythmia” = “หัวใจเต้นผิดจังหวะ/H̄ạwcı tên p̄hid cạngh̄wa” as a Thai sentence/EDU expression which is presented in the following with the part of speech after using word and EDU segmentation tools [29,32,33]:
  EDU: ((หัวใจ/H̄ạwci)/Noun)NP1 ((เต้น/tên)/Verbstrong (ผิดจังหวะ/p̄hid cạngh̄wa)/Adv)/VP
  These segmented words of the EDU segmentation are then translated from Thai to English by the Lexitron dictionary and WordNet for collecting the element concepts in the Verbstrong, Noun, and Adv sets (on Figure 2) as shown in the following:
  ((หัวใจ/H̄ạwci)/heart)/Noun, ((เต้น/tên)/pulse)/Verbstrong, and ((ผิดจังหวะ/p̄hid cạngh̄wa)/irregularly)/Adv,
  Example 3. “palpitation” = “อาการใจสั่น/Xākār Cı s̄ạ̀n)” as a Thai noun phrase expression which is presented in the following with the part of speech after using the word segmentation tool [29]:
  ((อาการ/Xākār)/Noun (ใจ/Cı)/Noun (สั่น/s̄ạ̀n)/Verbstrong)/NP
  The results of this NP’s term translation from Thai to English by the Lexitron dictionary and WordNet for collecting the element concepts in the Noun and Verbstrong sets (on Figure 2) are ((อาการ/Xākār)/symptom)/Noun, ((ใจ/Cı)/heart)/Noun, and ((สั่น/s̄ạ̀n)/shake)/Verbstrong.
  According to the Thai to English translation of Example 2 and Example 3, if the segmented Thai word has several English concepts, the expert will select the corresponding English concept for collecting the element concepts.
Determination of wcSymt,ij features from each dt,i document: After stemming words and eliminating stop words of either the learning corpus or the test corpus on each disease type, EDUh,Symt,ij of the dt,i document is identified by the predicate-verb term (va ∈ Verbstrong ∪ Vinf). The wc expression of EDUh,Symt,ij is then obtained by V, W1, and W2 of the wc pattern (see Section 3.1) including the following general rules (R1, R2):
  R1: w1,g is a Noun element of the previous EDU’s NP1 if the current EDU’s NP1 is an ellipsis (where w1,g ∈ W1).
  R2: if (((vweak,b + winf,c) ∨ (vstrong,f + w2,e)) ∈ Symptom-Cue) ∧ (Symptom-Cue={มี/have+อาการ/symptom’, ‘เป็น/be + อาการ/symptom’, ‘เกิด/occur + อาการ/symptom’}, then we take the next two words right after “vweak,b + winf,c” or”‘vstrong, f + w2,e” to be the new va + w2,e expression as the symptom expression instead where vweak,b ∈ Verbweak; winf,c ∈ Noun ∪ Adj ∪ Verbstrong; vstrong,f ∈ Verbstrong; w2,e ∈ W2; va ∈ Verbstrong ∪ Vinf; (vweak,b + winf,c) ∈ Vinf ; and a, b, c, e, and f are an integer as the element index.
  The wcsymt,ij features of the wc expressions are automatically determined by R3, R4, and R5 as the concept rules and are represented by the predicate-argument pattern after the wc expressions of the disease-symptom documents are translated from Thai to English by the collected element concepts in the Verbstrong, Verbweak, Noun, Adj, and Adv sets.
  R3: if (w1,g ∈ Person) ∧ (Person = {‘คนไข้, ผู้ป่วย/patient’, ‘ผู้สูงอายุ/elder’, ‘ทารก/infant’, ‘เด็ก/child’, ‘วัยรุ่น/teenager’, ……}), then the w1,g concept is “person”.
  R4: if w2,e ∈ Symptom-Expression-Level, then w2,e has a null value for a general concept (where Symptom-Expression-Level = {‘มาก/very’, ‘เล็กน้อย/little’, ‘ปานกลาง/moderately’,…}).
  R5: if (vweak,b = ‘รู้สึก/feel’) ∧ (winf,c= vstrong, f) ∧ (vstrong, f ∈ Verbstrong), then (vweak,b + winf,c)= vstrong, f.
  For example, the concept of (รู้สึก/feel)/Verbweak + (ปวด/pain)/Verbstrong is equivalent to “pain” as shown in the following EDU with the wcsymt,ij feature as pain(person, stomach).
  EDU: “คนไข้รู้สึกปวดกระเพาะอาหาร/The patient feels pain in the stomach”.
  (คนไข้/Khnk̄hị̂)/patient)/NP1 ((รู้สึก/rū̂s̄ụk)/feel)/Verbweak (ปวด/pwd)/pain)/Verbstrong (กระเพาะอาหาร/krapheāa xāh̄ār)/stomach)/Noun)/VP.
In addition to the complications, if the element of the complicationTerm set is identified, then all wcSymt,ij features right after the occurrence of the complicationTerm set element are excluded.
Therefore, all wcSymt,ij features of the wc expressions from each dt,i document of the learning corpus are determined and grouped into the corresponding symptom-concept group of the dt,i feature (as SymGroupdt,i) for the automatic supervised-learning step of dt,i wcSymt,ij Pairs having CErel in Section 4.3.

4.3. Automatic-Supervised-Learning of dt,i wcSymt,ij Pairs Having CErel

Each dt,i feature and the SymGroupdt,i elements as the wcSymt,ij features determined from the previous step are used for this step of the automatic supervised learning, where Typet (t = 1) is the infectious disease type, and Typet (t = 2) is the non-infectious disease type. The Typet’s learning corpus, containing several instances of dt,i wcSymt,ij pairs resulted from the Cartesian product of DS × SG, is used for learning dt,i wcSymt,ij pairs having CErel by NB, SVM, and LR on each Typet. The positive instance (CErel class) of each dt,i wcSymt,ij pair is formed if Typet of Dt and Typet of SymGrpt are the same; otherwise, the negative instance (nonCErel class) is formed according to Typet of Dt (see Section 3.2).
After the Cartesian product of DS × SG results on the learning corpus with the 70 di documents, the positive and the negative instances of each Typet from the Cartesian product results are then selected by the random sampling of the positive instances and the negative instances with an approximately equal number of positive and negative instances covering all 70 di features. The sizes of the Type1’s learning sample and the Type2′s learning sample are 1878 instances and 2125 instances, respectively, containing both the positive and negative instances.
NB learning [24]: the feature sets, Dt and SymGrpt, exist in the dt,i wcSymt,ij pairs of the positive/negative instances with the CE-rel/nonCE-rel class, respectively, formed by the automatic supervised learning on each disease-type learning sample. The learning results of this step by using Weka [35] are the feature probabilities of dt,i and wcSymt,ij in Typet where each wcSymt,ij feature is represented by the symptom-concept code (see Table 2).
SVM learning [23] with the linear kernel: The linear function, f(x), of the input x = (x1…xn) assigned to the positive class if f(x) ≥ 0, and otherwise to the negative class if f(x) < 0, can be written as follows:
f x = w · x + b = k = 1 n w k x k + b
where
  • x is a dichotomous vector;
    w is a weight vector;
    b is a bias;
    and (w,b)∈Rn × R are the parameters that control the function.
With regard to each disease-type learning sample, the SVM learning determines wk and b for dt,i and wcSymt,ij features (xk) in each disease-symptom pair (dt,i wcSymt,ij) with either the positive class (CE-rel) or the negative class (nonCE-rel) formed by the automatic supervised learning on each disease-type learning sample.
LR [25]: The logistic regression model of the research is based on the linear logistic regression with binary vector data. Usually, the input data with any value would be used to establish which attributions are influential in predicting the given outcome with values between 0 and 1, and hence can be interpreted as a probability. The logistic function can be written as:
F x = 1 1 + e β 0 + β 1 x 1 + β 2 x 2
F(x) is interpreted as the probability of the given outcome to be predicted where x1 and x2 are attribute variables, and β 0, β 1, and β 2 are the model estimators which play the role of momentum for each attribute. The LR learning is to determine β 0, β 1, and β 2 for dt,i and wcSymt,ij as x1 and x2 features, respectively, in each disease-symptom pair (dt,i, wcSymt,ij) with either the positive/CErel class or the negative/nonCErel class formed by the automatic supervised learning on each disease-type learning sample.

4.4. Determination of dt,i wcSymt,ij Pairs Having CErel for Collecting dt,i-SymGroupdt,I Pair with CErel

There are three steps in the CErel determination from the test corpus consisting of 70 di documents: dt,i and wcSymt,ij Determination, Determination of dt,i wcSymt,ij Pairs having CErel, and Collection of dt,i-SymGroupdt,i Pairs having CErel
i.
dt,i and wcsymt,ij Determination
The disease name concept, dt,i, is determined from the dt,i document having di as the topic name in Typet by the named-entity recognition [30,31] followed by the named-entity translation from Thai to English by the Lexitron dictionary in each disease type of the test corpus. After stemming words and eliminating stop words of the test corpus on each disease type, the EDUh,Symt,ij occurrence is identified by the predicate-verb term set (V; V = Verbstrong ∪ Vinf) to each EDUh occurrence on the dt,i document. According to R1–R5, each wcsymt,ij feature of the wc expression on the identified EDUh,Symt,ij occurrence is then determined by translating all terms in the predicate-argument pattern from Thai to English by the collected element concepts in the Verbstrong, Verbweak, Noun, Adj, and Adv sets. In addition to the complications, if the element of the complicationTerm set is identified, then all wcSymt,ij features right after the occurrence of the complicationTerm set element are excluded.
ii.
Determination of dt,i wcsymt,ij Pairs having CErel
The objective of this step is to determine the dt,i wcsymt,ij pairs having the CErel class by NB, SVM, and LR of each disease type on the test corpus.
NB: The cause-effect relation between the dt,i features and wcSymt,ij features through test corpus of each disease type is solved by Equation (13) along with the probabilities of dt,i and wcSymt,ij on Table 2.
DSympPairClass = arg   max c l a s s     C l a s s P c l a s s | d t ,   i w c S y m   t ,   i j   = arg   max c l a s s     C l a s s P d t ,   i | c l a s s P w c S y m   t ,   i j | c l a s s P c l a s s
where
  • DSympPairClass is a relation class between a disease-name concept and a symptom concept on a disease-symptom pair;
    d t , i D t on which D t is a disease name set in Type t ;
    t = 1 , 2 ;
    i = 1 , 2 , ,   n u m o f D i s e a s e E l e m e n t s ;
    w c S y m   t , i j is a wc concept, particularly a symptom concept of a w c expression on the d t , i document ;
    C l a s s = C E - r e l , n o n C E - r e l .
If DSymPairClass of Equation (13) is the CE-rel/CErel class, the dt,i wcsymt,ij pair is collected into DSPt (which is a list of disease-symptom pairs having CErel in Typet) as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).
Algorithm 1 DeterminationOfDiseaseSymptomPairsHavingCErel Algorithm.
Assume that each EDU is represented by (NP VP) including stemming words and stop word removal; Ldti is a list of EDUs on dti;
dt,i is a disease name of Typet(t={1,2}) i=1,2,.., numofDiseaseElementst;
DSPt is alistofdisease-symptom pairs with CErel in Typet;
DNamet is a disease name set in Typet;
DETERMINATION_OF_DISEASE_SYMPTOM⌝_PAIRS_HAVING⌝_CEREL
1Dname1; DName2;
2ArrayList<string>[] DSP = new ArrayList[2];
String[][] d = new String[2][40];
/*DSPt (t={1,2})contains two ArrayList data structures for Type1 and Type2
3Set<String> complicationTermSet = new HashSet<String>();
     /*  complicationTermSet is a set of complication terms
4complicationTermSet.add (“ภาวะแทรกซ้อน/complcations”);
5complicationTermSet.add (“ไม่รักษา/unTreat”);
     /* add more elements into complicationTermSet
6Count = ComplicationTermSet.size()
/* determine size which is the number of complicationTermSet elements.
7String[] CTS = ComplicationTermSet.toArray
       (new String[ComplicationTermSet.size()]);
/* Convert ComplicationTermSet as a set structure to an Array structure.
8For (t= 1 to 2 ; t++ )
9{1 If t=1 then n=33 else n=37;
10 For (i= 1 to n ; i++ )
11    {2 dti = getDiseaseNameConceptFromDocumentTopicName /*get a disease name
12     Dnamet = Dnamet dti ; h=1; j=0; complication=0 ;
     /* Each desease name element is collected into Dnamet
13    while h length[Ldti]  complication=0 do
14    {3 For (k= 1 to Count; k++) /*check complications.
15      If EDUh.contains(CTSk)then complication=1;
16      vh = EDUh.VP.verb; wrdh = EDUh.VP.word
          /* verb is an EDUh’s main verb (a verb of EDUh.VP) ;
          /* word is a word right after an EDUh’s main verb of EDUh.VP
17    If (complication = 0)(( vh Vstrong)( vh + wrdh Vinf))then
     /*  V is the predicate-verb term set; V = VerbstrongVinf .
18    {4 If EDUh.headNounOfNP1 W1 EDUh.firstWordOfNP2 W2 then
    /* W1 is an agent-term set; W2 is a linguistic-patient term set
19      {5 j++;
20     sym =  wcsymt,ijDetermination;
       /* based on wc Expression of EDUh,Symt,ij by using R1-R5.
21        switch (choice) {
22        Case 1:CErelDetermination(dt,i,sym)byEquation13; break;/*NB
23        Case 2:CErelDetermination(dt,i,sym)byEquation11; break;/*SVM
24        Case 3:CErelDetermination(dt,i,sym)byEquation12, break;/*LR
         }
25        If (class= ‘Positive’) (class =’CE-rel’) then
26       DSPt.AddCause⌝EffectPairWithCErel(dt,i+“-“+sym);
27      }5 }4 h++ }3 }2 }1
28}Return Dnamet, DSPt
SVM: The cause-effect relation between dt,i and wcSymt,ij of the dt,i wcSymt,ij pair from the test corpus of each disease type is solved by the weight vector from all dt,i and wcSymt,ij features. The weight vector and the bias obtained from the SVM learning by using Weka [35] are used to determine the dt,i wcSymt,ij pair with CE-rel by Equation (11). If f(x) ≥ 0, the dt,i wcsymt,ij pair with CE-rel/CErel as the positive class occurs, otherwise the nonCE-rel/nonCErel occurs as the negative class. The dt,i wcSymt,ij pair with the positive class is collected into DSPt as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).
LR: The research applies Equation (12) to determine the DSympPair class which is a relation class, i.e., a CE-rel/nonCE-rel class, between a disease-name concept (dt,i) and a symptom concept (wcSymt,ij) on a dt,i wcSymt,ij pair from the test corpus of each disease type, whilst F(x) is interpreted as the probability of either “positive/CErel” as the CE-rel class or “negative/nonCErel” as the nonCE-rel class by the following rules.
Rule 1 (CE-rel_Class): If (f(x)CE-rel_Class ≥ 0.5, then the dt,i wcSymt,ij pair has CErel between dt,i and wcSymt,ij.
Rule 2 (nonCE-rel_Class): If (f(x) nonCE-rel_Class ≥ 0.5, then the dt,iwcSymt,i pair has nonCErel between dt,i and wcSymt,ij.
DSympPair Class = Max (f(x) CE-rel_Class, f(x) nonCE-rel_Class)
According to Equation (14), x1 and x2 are the attribute variable pair of each dt,i wcSymt,ij pair from the test corpus of each disease type where ß0, ß1, and ß2 of dt,i and wcSymt,ij are obtained by the automatic supervised learning by LR on the learning sample of each disease type. The dt,i wcSymt,ij pair with the CE-rel class is collected into DSPt as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).
iii.
Collection of dt,i-SymGroupdt,i Pairs having CErel
All correct determined dt,i wcSymt,ij pairs having CErel in DSPt from the previous step are grouped by the same dt,i into SymGroupdt,i resulting in the dt,i-SymGroupdt,i pair with CErel as shown in Figure 6.
Dnamet from the Algorithm 1 results is Dt on Equation (3); Dt = { dt,1, dt,2,….numt} where numt is numofDiseaseElementst in Typet.
According to Figure 6, all SymGroupdt,i of the dt,i-SymGroupdt,i pairs having CErel are collected into SymGrpt.
SymGrpt = {SymGroupdt,1, SymGroupdt, 2, .. SymGroupdt,numt};
Dt and SymGrpt are used for the DSKG construction in the next step.

4.5. DSKG Construction

According to Dt and SymGrpt, the subsets of SymGrpt form a union into Stypet by Equation (15). In addition, Stypet of both disease types form a union into S (which is the union of all SymGroupdt,i from both disease types) by Equation (16).
Stype t = i = 1 n u m o f D n a m e t SymGroup d t   ,   i
where
  • Stype t is a symptom-concept set of Type t (t = 1, 2);
    n u m o f D n a m e t is cardinality of the D t or Dname t set;
    SymGroup d t   , i is a symptom-concept group resulted by d t ,   i
S = Stype 1   Stype 2 = s 1 , s 2 , s m
where m is the cardinality of S.
S is then the symptom-concept feature vector with the vector size m where m of the research is 143. PCA (based on IBM SPSS Statistics for Windows, Version 21.0) is used to reduce the number of m features of the symptom-concept feature vector by wrapping each sc feature element having the feature loading weight from the eigenvector ≥ |0.4| based on our corpus within the corresponding Fgroupz to become Factorz (z = 1, 2,.., 39) as shown in Table 3.
In Table 3, the number of sc features is reduced from 143 to 39 groups of Fgroupz. From Equation (16), we conclude Di as in Equation (17).
D i = t = 1 2 D t
Therefore, di pairs (di ∈ Di) with Factorz instead of SymGroupdi (see Table 4) where Factorz consists of the correlated sc feature elements with the FactorScorez calculation to each disease name, di.
In Table 4, we then select the Factorz occurrences as the common and relevant factors having the highest FactorScorez of each di for constructing DSKG (see Figure 7).

5. Evaluation and Discussion

The research has three evaluations of Automatic Determination of wcSymt,ij Features, Determination of dt,i wcSymt,ij Pair Pairs Having CErel, and DSKG Construction.

5.1. Automatic Determination of wcSymt,ij Features

The test sample of 35 different di documents from both disease types on the test corpus is used for evaluating the automatic determination of the wcSymij features as the SymGroupdt,i elements without the concept annotation. The automatic determination of the wcSymt,ij features of the research is evaluated in terms of a precision, a recall, and a F1-score on Table 5 based on the answer sets provided by three experts with max win voting. The number of different wcSymt,ij (symptom-concept) features determined from the test sample on the test corpus shows that the Type2 (the non-infectious disease type) documents have more diversity symptom-concept features resulting in lower frequencies of most symptom-concept features (as shown in Figure 8) than the Type1 (the infectious disease type) documents.
From Table 5, the averages of the precision, the recall, and the F1-score for the wcSymt,ij feature determination from the Type1 and Type2 documents are 0.890, 0.773, and 0.827, respectively. The reason for having the low average recall of the wcSymt,ij feature determination from both disease types is the symptom with the long-tailed explanation, e.g.,
EDU: “คนไข้มีต่อมน้ำเหลืองบริเวณด้านหลังลำคอโตผิดปกติ/Khnk̄hị̂ mī t̀xmn̂ả h̄elụ̄xng briweṇ d̂ān h̄lạng lả khx to p̄hid pkti” (The patient has lymph nodes in the back of the neck being abnormally large.).
“(((คนไข้/Khnk̄hị̂)/patient)/Noun)/NP1 (((มี/mī)/has)/Verbweak ((ต่อมน้ำเหลือง/xmn̂ảh̄elụ̄xng)/lymph-nodes)/Noun ((บริเวณด้านหลัง/briweṇ d̂ān h̄lạng)/back area)/Noun ((ลำคอ/lả khx)/neck)/Noun ((โต/to)/large)Verbstrong ((ผิดปกติ/p̄hid pkti)/abnormally)/Adv)/VP”.
where the wc pattern cannot cover the “(โต/to)/large” term as the main symptom concept. However, the result of the medical symptom recognition [16] from the patient text based on multi-label classification labeling by the expert is 0.76 F1-Score which covers the symptom with the long-tailed explanation.

5.2. Determination of dt,i wcSymt,ij Pairs Having CErel

According to the test corpus, by the automatic supervised learning, the correct determined dt,i wcSymt,ij pairs having CErel are collected or grouped by the same dt,i into the dt,i-SymGroupdt,i pairs having CErel. Thus, the correct determined dt,i wcSymt,ij pairs having CErel also result in the correct dt,i-SymGroupdt,i pairs having CErel for the DSKG construction.
The determination of dt,i wcSymt,ij pairs having CErel by NB, SVM, and LR from the dt,i documents (4700 EDUs of the d1,i documents and 5300 EDUs of the d2,i documents) with the automatic-supervised learning is evaluated in terms of a precision, a recall and, a F1-score as shown in Table 6 based on three experts with max win voting. According to the balance data from our learning corpus, we also report the Receiver Operator Characteristics (ROC) curves to show each classifier performance by Weka as shown in Figure 9. The use of ROCs to evaluate the performance follows the study layout by [36].
According to Figure 8, the lower diversity and higher frequency of the wcSym1,ij features from the Type1 documents than the ones of the wcSym2,ij features from the Type2 documents result in Type1 having a better precision, better recall, and better F1-score of determining dt,i wcSymt,ij pairs having CErel by NB, SVM, and LR than Type2 does as shown in Table 6. With regard to NB, SVM, and LR, the evaluation results of determining dt,i wcSymt,ij pairs having CErel depend on (1) the frequencies of the dt,i and wcSymt,ij features for NB, (2) the weight vector for SVM, and (3) the model estimators (ß1, and ß2) including the frequencies of dt,i and wcSymt,ij features for LR. Therefore, the precision, recall, and F1-score results of determining dt,i wcSymt,ij pairs having CErel of each disease type by SVM are higher than the results of determining dt,i wcSymt,ij pairs having CErel of each disease type by NB because most of the wcSymt,ij features of each disease type have the low frequencies of occurrences on the documents (see Figure 8). According to Type1, the F1-score result of determining dt,i wcSymt,ij pairs having CErel by LR is 0.009 higher than the one by SVM since some wcSym1,ij features in the Type1 documents have the high frequency occurrences. According to Type2, the F1-score in determining dt,i wcSymt,ij pairs having CErel by LR is 0.001 lower than the one by SVM because most of wcSym2,ij features in the Type2 are low frequency features. The LR and SVM performances evaluated by the F1-score are not quite different, whereas the LR and SVM performances evaluated by ROC Area or AUC (Area under Curve) are quite different in both disease types (see Figure 9). According to [36], LR with ROC Area = 0.850 and 0.843 in Type1 and Type2, respectively, is considered to be excellent performance in classifying dt,i wcSymt,ij pairs having CErel in both disease types. Therefore, we use the LR estimator to determine dt,i wcSymt,ij pairs having CErel from the external data of another corpus (e.g., Patient Care and Health Information from Mayo Clinic (https://www.mayoclinic.org/patient-care-and-health-information (accessed on 20 May 2022) after using WordNet and MeSH to determine the symptom concepts from noun phrases with the evaluation results as follows: Type1 has the 0.750 precision, the 0.667 recall, and the 0.706 F1-score; Type2 has the 0.857 precision, the 0.600 recall, and the 0.706 F1-score. In contrast, our LR results of the 0.761 and 0.743 F1-scores for Type1 and Type2, respectively, are higher than the ones from the external data of another corpus because our symptom concepts are based on the composite variable.
Therefore, the average of F1-scores in determining dt,i wcSymt,ij pairs having CErel for Type1 and Type2 by NB, SVM, and LR are 0.743, 0.748, and 0.752, respectively, based on the automatic supervised learning on the disease feature set and the symptom feature set (the composite-feature set), whereas the previous work [14] determines the disease-symptom relation by neural network learning from each sentence containing a disease-name and symptom-term pair based on the human annotation with the 0.883 F1-score, even though their symptom terms are based on nouns/noun phrases. In addition, another previous work [13] automatically determines the disease-symptom relation by NB, LR, and the Bayesian network using noisy OR gates with the 0.704 F1-score from the patient textual data (where the symptom occurrences are based on the noun/noun phrases expression feature set). The work [13] based on the specific cases is useful for the physicians in the curative system, whereas our research based on two disease-symptom corpora randomly selected from two medical web-board resources, i.e., the Thai hospital and Thai-Healthcare-Knowledge web-boards, is useful for the non-professionals in preliminary diagnosis.
Moreover, the performance of the CErel classification model based on machine learning is generally evaluated by the F1-score and the ROC curve which infers a statistic test by [37] as follows: the area under the ROC curve (AUC) of LR = 0.843 (Type2) yields the statistic of 10.6 with a p-value < 0.001, indicative of the excellent discriminating ability.

5.3. DSKG Construction

We evaluate the DSKG construction in terms of a 10-point score for a concise representation and a 10-point score for a precise representation which are based on the assessments with scoring by three experts and two end-users who are non-professional persons. The results of the average assessments of the concise representation and the precise representation by the experts and end-users are 9 and 7.8, respectively. The reason for the precise representation score of 7.8 is that the cause-effect network should include some information of a patient condition and a severity of the conditions. However, the DSKG construction by PCA enhances the primary health care by enabling the non-professional persons in primarily diagnosis problems with the concise and precise representation through an online recommender application system, whereas the previous works on the cause-effect graph/network construction from texts mostly are not as concise as in [13,15] containing all symptom terms of each disease term.

6. Conclusions

This research approaches constructing a DSKG consisting of several CErel connections determined from the downloaded disease documents on the Thai hospital and Thai-Healthcare-Knowledge web-boards, where each CErel connection links the causative-concept node (or the <di>node represented by the di feature) to the correlated effect-concept node, or the <sn1, sn2, …, snη> node represented by a group of the correlated sc features as the common symptom-concept features. Moreover, all sc features of each di document on our research are based on the basic symptoms. Our proposed method of constructing the DSKG, which needs to reduce the high dimensional feature space of S (S = {sc}) for the graph construction, relies on the determination of the wcSymt,ij features representing the sc features and the automatic supervised learning dt,i wcSymt,ij pairs having CErel from the downloaded documents as the unstructured data. PCA is then proposed for constructing the DSKG by the dimensionality reduction of the symptom-concept feature space with minimizing information loss. To evaluate the proposed method, the conciseness and precision of the DSKG construction depends on the number of different symptom-concept features and the number of different disease-name concept features, whilst the accuracy of the CErel determination depends on the corpus behavior, e.g., the diversity of the wcSymt,ij feature occurrences and the frequencies of the wcSymt,ij feature occurrences. In addition, the accuracy of the wcSymt,ij feature determination depends on the number of symptoms with long-tailed explanations. In contrast to previous works of the cause-effect/disease-symptom relation determination and the cause-effect/disease- symptom knowledge graph/network construction from the documents, (1) our determination of the symptom-concept feature, wcSymt,ij, as the composite variable (which relies on the predicate-argument pattern from NP1 and VP through the preparation and collection of the Verbstrong, Verbweak, Noun, Adj, and Adv sets from the medical-symptom-expression list on Wikipedia including MeSH without the concept annotation) has a high F1-Score. Whereas the symptom-concept determination of the previous works are only based on either the noun-phrase [12,13,14,15] or verb phrase [16] concepts by either the expert annotation or the automatic string-matching to ICD-9 codes and UMLS concepts, (2) our determination of dt,i wcSymt,ij pairs having CErel is based on the Cartesian product of DS × SG as the automatic supervised learning on each di feature on the noun/noun-phrase expression being paired with each group of the wcSymt,ij features on several EDUs’ wc expressions with/without the complications, whereas the previous works on determining the cause-effect/disease-symptom relation on the noun/noun-phrase pair with either supervised learning by experts [11] or the automatic determination relied on the ICD-9 codes [13] without concerning some dt,i documents containing both the basic symptoms and the complications; (3) our DSKG construction by PCA to reduce the high dimensional symptom-concept features presents the DSKG with preciseness and high conciseness, whereas the previous works of the cause-effect/disease-symptom graph/network construction from texts consist of each CErel connection between the causative-concept node (represented by one causative-concept feature) and the corresponding effect-concept node (represented by one effect-concept feature) without concerning the high dimensional effect features, particularly the symptom features [7,13,15]. Moreover, the DSKG results, e.g., in Figure 7, were frequently found to be in alignment with scientific findings and also the objective of this research. For example, following the DSKG in Figure 7, vascular diseases were found to be associated with several diseases including kidney disease and myocardial infarction which are also found in the literature (e.g., [38,39,40]). In the future, the temporal feature and the condition feature should be considered to increase the accuracy of the CErel determination which results in increasing the preciseness of the DSKG for web-application development as in [41]. Moreover, the proposed method can also be applied in other languages, and the DSKG of our research enhances the primary health care by supporting the non-professional persons with the knowledge structure in primarily diagnosis problems through the recommender system.

Author Contributions

Conceptualization, C.P. and R.P.; Data curation, C.P. and R.P.; Formal analysis, C.P. and R.P.; Funding acquisition, C.P.; Investigation, C.P. and R.P.; Methodology, C.P. and R.P.; Project administration, C.P.; Resources, C.P. and R.P.; Software, C.P.; Supervision, C.P.; Validation, C.P. and R.P.; Visualization, C.P.; Writing—original draft, C.P.; Writing—review & editing, R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kroenke, K. A practical and evidence-based approach to common symptoms: A narrative review. Ann. Intern. Med. 2014, 161, 579–586. [Google Scholar] [CrossRef]
  2. Khoo, C.; Na, J.C. Semantic Relations in Information Science. Annu. Rev. Inf. Sci. Technol. 2006, 40, 157–228. [Google Scholar] [CrossRef] [Green Version]
  3. Carlson, L.; Marcu, D.; Okurowski, M.E. Building a discourse-tagged corpus in the framework of rhetorical structure theory. Curr. New Dir. Discourse Dialogue 2003, 22, 85–112. [Google Scholar]
  4. Pustejovsky, J. The syntax of event structure. Cognition 1991, 41, 47–81. [Google Scholar] [CrossRef]
  5. Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
  6. Girju, R. Automatic detection of causal relations for question answering. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Workshop on Multilingual Summarization and Question Answering-Machine Learning and beyond, Sapporo, Japan, 7–12 July 2003; pp. 76–83. Available online: https://aclanthology.org/W03-1210.pdf (accessed on 25 June 2007).
  7. Chang, D.-S.; Choi, K.-S. Incremental cue phrase learning and bootstrapping method for causality extraction using cue phrase and word pair probabilities. Inf. Process. Manag. 2006, 42, 662–678. [Google Scholar] [CrossRef]
  8. Riaz, M.; Girju, R. Recognizing Causality in Verb-Noun Pairs via Noun and Verb Semantics. In Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language, Gothenburg, Sweden, 6 April 2014; pp. 48–57. Available online: https://aclanthology.org/W14-0707.pdf (accessed on 25 May 2022).
  9. Zhao, S.; Liu, T.; Zhao, S.; Chen, Y.; Nie, J.-Y. Event causality extraction based on connectives analysis. Neurocomputing 2016, 173, 1943–1950. [Google Scholar] [CrossRef]
  10. Sorgente, A.; Vettigli, G.; Mele, F. A Hybrid Approach for the Automatic Extraction of Causal Relations from Text. Emerg. Ideas Inf. Filter. Retr. 2017, 746, 15–29. [Google Scholar] [CrossRef]
  11. Cao, M.; Sun, X.; Zhuge, H. The contribution of cause-effect link to representing the core of scientific paper—The role of Semantic Link Network. PLoS ONE 2018, 13, e0199303. [Google Scholar] [CrossRef] [Green Version]
  12. Hassan, M.; Makkaoui, O.; Coulet, A.; Toussain, Y. Extracting Disease-Symptom Relationships by Learning Syntactic Patterns from Dependency Graphs. In Proceedings of BioNLP 15; Association for Computational Linguistics: Beijing, China, 2015; pp. 71–80. [Google Scholar] [CrossRef]
  13. Rotmensch, M.; Halpern, Y.; Tlimat, A.; Horng, S.; Sontag, D. Learning a Health Knowledge Graph from Electronic Medical Records. Sci. Rep. 2017, 7, 5994. [Google Scholar] [CrossRef]
  14. Wada, S.; Iida, R.; Torisawa, K.; Takeda, T.; Manabe, S.; Matsumura, Y. Extracting Symptom Names and Disease-Symptom Relationships from Web Texts Using a Multi-Column Convolutional Neural Network. MedInfo 2019, 264, 423–427. [Google Scholar] [CrossRef]
  15. Zhou, X.; Menche, J.; Barabasi, A.; Sharma, A. Human symptoms–disease network. Nat. Commun. 2014, 5, 4212. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Mottaghi, A.; Sarma, P.K.; Amatriain, X.; Yeung, S.; Kannan, A. Medical symptom recognition from patient text: An active learning approach for long-tailed multilabel distributions. arXiv 2021, arXiv:2011.06874. [Google Scholar]
  17. Song, M.-K.; Lin, F.-C.; Ward, S.E.; Fine, J.P. Composite Variables. Nurs. Res. 2013, 62, 45–49. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Zawbaa, H.M.; Emary, E.; Grosan, C.; Snasel, V. Large-dimensionality small-instance set feature selection: A hybrid bio-inspired heuristic approach. Swarm Evol. Comput. 2018, 42, 29–42. [Google Scholar] [CrossRef]
  19. Figueiredoa, F.; Rocha, L.; Couto, T.; Salles, T.; Gonçalves, M.A.; Meira, W., Jr. Word co-occurrence features for text classification. Inf. Syst. 2011, 36, 843–858. [Google Scholar] [CrossRef]
  20. Sedighi, M. Application of word co-occurrence analysis method in mapping of the scientific fields (case study: The field of Informetrics). Libr. Rev. 2016, 65, 52–64. [Google Scholar] [CrossRef]
  21. Nykamp, D.Q. Cartesian Product Definition, from Math Insight. Available online: http://mathinsight.org/definition/cartesian_product (accessed on 25 May 2022).
  22. Falke, T.; Gurevych, I. Utilizing Automatic Predicate-Argument Analysis for Concept Map Mining. In Proceedings of the 12th International Conference on Computational Semantics (IWCS), Montpellier, France, 19–22 September 2017; pp. 1–10. Available online: https://aclanthology.org/W17-6909 (accessed on 25 May 2022).
  23. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar] [CrossRef]
  24. Mitchell, T.M. Machine Learning; The McGraw-Hill Co., Inc.: New York, NY, USA; MIT Press: Singapore, 1997; Available online: https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf (accessed on 25 September 2000).
  25. Freedman, D.A. Statistical Models: Theory and Practice; Cambridge University Press: Cambridge, UK, 2009; Available online: https://scholar.google.co.th/citations?user=HntkbYMAAAAJ&hl=en&oi=sra (accessed on 25 June 2015).
  26. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
  27. Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis, A Wiley-Interscience Publication; John Wiley & Sons, Inc.: New York, NY, USA, 2001; Available online: https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf (accessed on 25 April 2021).
  28. Reutrakul, S.; Deerochanawong, C. Diabetes in Thailand: Status and Policy. Curr. Diabetes Rep. 2016, 16, 28. [Google Scholar] [CrossRef]
  29. Sudprasert, S.; Kawtrakul, A. Thai word segmentation based on global and local unsupervised earning. In Proceedings of the NCSEC 2003, Chonburi, Thailand, 28–30 October 2003; pp. 1–8. [Google Scholar]
  30. Chanlekha, H.; Kawtrakul, A. Thai named entity extraction by incorporating maximum entropy model with simple heuristic information. In Proceedings of the IJCNLP 2004, Haina, China, 22–24 March 2004; pp. 1–7. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.1449 (accessed on 25 December 2007).
  31. Tongtep, N.; Theeramunkong, T. Pattern-based Extraction of Named Entities in Thai News Documents. Thammasat Int. J. Sci. Technol. 2010, 15, 70–81. [Google Scholar]
  32. Chareonsuk, J.; Sukvakree, T.; Kawtrakul, A. Elementary discourse unit segmentation for Thai using discourse cue and syntactic information. In Proceedings of the NCSEC 2005, Bangkok, Thailand, 27–28 October 2005; pp. 85–90. [Google Scholar]
  33. Ketui, N.; Theeramunkong, T.; Onsuwan, C. Thai elementary discourse unit analysis and syntactic-based segmentation. Information 2013, 16, 7423–7436. [Google Scholar]
  34. Smith, J.G.; Duncan, A.J. Elementary Statistics and Applications: Fundamentals of the Theory of Statistics; Mc GrawHill Book Company Inc.: London, UK, 1944. [Google Scholar] [CrossRef]
  35. Eibe, F.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann 2016. Available online: https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 25 May 2022).
  36. Le, N.Q.K.; Kha, Q.H.; Nguyen, V.H.; Chen, Y.-C.; Cheng, S.-J.; Chen, C.-Y. Machine Learning-Based Radiomics Signatures for EGFR and KRAS Mutations Prediction in Non-Small-Cell Lung Cancer. Int. J. Mol. Sci. 2021, 22, 9254. [Google Scholar] [CrossRef] [PubMed]
  37. Mandrekar, J.N. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. J. Thorac. Oncol. 2010, 5, 1315–1316. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Nakamura, R.; Tomiyoshi, G.; Shinmen, N.; Kuroda, H.; Kudo, T.; Doi, H.; Mine, S.; Machida, T.; Kamitsukasa, I.; Wada, T.; et al. An anti-deoxyhypusine synthase antibody as a marker of atherosclerosis-related cerebral infarction, myocardial infarction, diabetes mellitus, and chronic kidney disease. SM Atheroscler. J. 2017, 1, 1001. [Google Scholar]
  39. Daniels, S.R.; Bates, S.; Lukin, R.R.; Benton, C.; Third, J.; Glueck, C.J. Cerebrovascular arteriopathy (arteriosclerosis) and ischemic childhood stroke. Stroke 1982, 13, 360–365. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. González-Ramírez, J.; Martínez-Coronilla, G.; López-Rocha, L.D.; Leija-Montoya, A.G.; Hernández-Díazcouder, A.; Fontes-Garcia, Z.; Silva-Mancilla, M.; Sánchez-Muñoz, F. Acute Myocardial Infarction and Periodontitis: Importance of Awareness and Prevention in Latin America. Appl. Sci. 2022, 12, 3131. [Google Scholar] [CrossRef]
  41. Mazorchuck, M.; Dobriak, V.; Chumachenko, D. Web-Application Development for Tasks of Prediction in Medical Domain. In Proceedings of the 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine, 11–14 September 2018; Volume 1, pp. 5–8. [Google Scholar] [CrossRef]
Figure 1. Present DSKG formed by several CErel connections on which each CErel connection links each <di> node to a <s1, s3, ..., sβ> node as an occurrence of multiple symptoms where s1, s3, ..., and sβ are the correlated sc features; d1-->s1 ∧ s2 ∧ s8 … ∧ sβ; d2-->s1s3 ∧ s4 ∧ s8 … ∧ sβ; …; dnum-->s1 ∧ s5 ∧ s8 … ∧ sβ.
Figure 1. Present DSKG formed by several CErel connections on which each CErel connection links each <di> node to a <s1, s3, ..., sβ> node as an occurrence of multiple symptoms where s1, s3, ..., and sβ are the correlated sc features; d1-->s1 ∧ s2 ∧ s8 … ∧ sβ; d2-->s1s3 ∧ s4 ∧ s8 … ∧ sβ; …; dnum-->s1 ∧ s5 ∧ s8 … ∧ sβ.
Applsci 12 06615 g001
Figure 2. A general Thai linguistic expression including Thai-to-English translation by Lexitron Dictionary where NP1 and NP2 are noun phrases; VP is a verb phrase; Verbstrong is a strong-verb concept set; Verbweak is a weak-verb concept set; Adv is an adverb concept set; Adj is an adjective concept set; and Noun is a noun concept set. English translations for each non-English texts are provided in figure.
Figure 2. A general Thai linguistic expression including Thai-to-English translation by Lexitron Dictionary where NP1 and NP2 are noun phrases; VP is a verb phrase; Verbstrong is a strong-verb concept set; Verbweak is a weak-verb concept set; Adv is an adverb concept set; Adj is an adjective concept set; and Noun is a noun concept set. English translations for each non-English texts are provided in figure.
Applsci 12 06615 g002
Figure 3. The EDU expressions with symptom concept on the downloaded disease document of Diabetes (where a [...] symbol means an ellipsis of a word/words inside the symbol). English translations for each non-English texts are provided in figure.
Figure 3. The EDU expressions with symptom concept on the downloaded disease document of Diabetes (where a [...] symbol means an ellipsis of a word/words inside the symbol). English translations for each non-English texts are provided in figure.
Applsci 12 06615 g003
Figure 4. The positive and negative sample instances from the result of the Cartesian product of DS × SG.
Figure 4. The positive and negative sample instances from the result of the Cartesian product of DS × SG.
Applsci 12 06615 g004
Figure 5. A Framework of DSKG Construction.
Figure 5. A Framework of DSKG Construction.
Applsci 12 06615 g005
Figure 6. Show dt,i-SymGroupdt,i pairs having CErel after grouping correct determined dt,i wcSymt,ij pairs having CErel by dt,i.
Figure 6. Show dt,i-SymGroupdt,i pairs having CErel after grouping correct determined dt,i wcSymt,ij pairs having CErel by dt,i.
Applsci 12 06615 g006
Figure 7. Show DSKG with each arrow being CErel which connects di to Factorz (or Fz), e.g., Factor18/F18 and Factor20/F20 containing some common sc features represented by the underlined sc.
Figure 7. Show DSKG with each arrow being CErel which connects di to Factorz (or Fz), e.g., Factor18/F18 and Factor20/F20 containing some common sc features represented by the underlined sc.
Applsci 12 06615 g007
Figure 8. The frequencies of the correct determined wcSymt,ij features from 35 different di documents containing the 17 d1,i documents in Type1 and the 18 d2,i documents in Type2.
Figure 8. The frequencies of the correct determined wcSymt,ij features from 35 different di documents containing the 17 d1,i documents in Type1 and the 18 d2,i documents in Type2.
Applsci 12 06615 g008
Figure 9. Show ROC curves of NB, SVM, and LR classifiers of each disease type, i.e., Type1 and Type2 (TPR is True Positive Rate; FPR is False Positive Rate).
Figure 9. Show ROC curves of NB, SVM, and LR classifiers of each disease type, i.e., Type1 and Type2 (TPR is True Positive Rate; FPR is False Positive Rate).
Applsci 12 06615 g009
Table 1. Example terms of the medical-symptom-expression list used for collecting the element concepts in Verbstrong, Verbweak, Noun, Adj, and Adv sets of the segmented-term translation.
Table 1. Example terms of the medical-symptom-expression list used for collecting the element concepts in Verbstrong, Verbweak, Noun, Adj, and Adv sets of the segmented-term translation.
Example Terms of Medical-Symptom-Expression List on Wikipedia Web SiteMedical-Symptom-Expression Term Translation from English to Thai by Lexitron DictionaryTerm/EDU segmentationSegmented-Term Transalation fromThai to English by Lexitron Dictionary including WordNet
bleedingเลือดออก/Leụ̄xd xxk((เลือด/Leụ̄xd)/Noun)/NP1 ((ออก/xxk)/Verbstrong)/VP(Leụ̄xd/blood)/Noun
(xxk/come out)/Verbstrong
belching (a gerund of a belch verb)เรอ/Rex(เรอ/Rex)/Verbstrong (Rex/belch)/Verbstrong
fatigueอ่อนแรง/X̀xn ræng(อ่อนแรง/X̀xn ræng)/Verbstrong(X̀xn ræng/be weak)/Verbstrong
malaiseอ่อนเพลีย/X̀xnphelīy(อ่อนเพลีย/X̀xnphelīy)/Verbstrong (X̀xnphelīy/be tired)/Verbstrong
flatulenceอาการท้องเฟ้อ/Xākār tĥxngfêx ((อาการ/Xākār)/Noun
(ท้องเฟ้อ/tĥxngfêx)/Adj)/NP
(Xākār/symptom)/Noun
(tĥxngfêx/flatulence)/Adj
proctalgiaอาการปวดทวารหนัก/Xākār pwd thwār h̄nạk((อาการ/Xākār)/Noun
(ปวด/pwd)/Verbstrong
(ทวารหนัก/thwār h̄nạk)/Noun)/NP
(Xākār/symptom)/Noun
(pwd/pain)/Verbstrong
(thwār h̄nạk/anus)/Noun
tinnitusอาการมีเสียงในหู/Xākār mīs̄eīyng nı h̄ū
((อาการ/Xākār)/Noun
(มีเสียง/mīs̄eīy ng)/Vinf (ใน/nı)/Preposition
(หู/h̄ū)/Noun)/NP
(Xākār/symptom)/Noun
(mīs̄eīy/have a ringing in)/Vinf (h̄ū/ear)/Noun
claudicationอาการปวดขา/Xākār pwd k̄hā
((อาการ/Xākār)/Noun
(ปวด/pwd)/Verbstrong
(ขา/k̄hā)/Noun )/NP
(Xākār/symptom)/Noun
(pwd/pain)/Verbstrong
(k̄hā/leg)/Noun
choreaประสาทชักกระตุก/Pras̄āth chạk kratuk((ประสาท/Pras̄āth)/Noun)/NP1 ((ชักกระตุก/chạk kratuk)/Verbstrong)/VP(Pras̄āth/nerve)/Noun
chạk kratuk (convulse)/Verbstrong
muscle weaknessกล้ามเนื้ออ่อนกำลัง/Kl̂ām neụ̄̂xx̀xn kảlạng((กล้ามเนื้อ/Kl̂ām neụ̄̂x)/Noun)/NP
((อ่อนกำลัง/x̀xn kảlạng)/Verbstrong)/VP
(Kl̂ām neụ̄̂x/muscle)/Noun:
(x̀xn kảlạng/weaken)/Verbstrong
anorexia เบื่ออาหาร/Beụ̄̀x xāh̄ār ((เบื่อ/Beụ̄̀x)/Verbstrong
(อาหาร/xāh̄ār)/Noun)/VP
(Beụ̄̀x/be bored with)/Verbstrong
(xāh̄ār/food)/Noun
arrhythmiaหัวใจเต้นผิดจังหวะ/H̄ạwcı tên p̄hid cạngh̄wa ((หัวใจ/H̄ạwcı)/Noun)/NP1
((เต้น/tên)
(ผิดจังหว/p̄hid cạngh̄wa:)/Adv)/VP
(H̄ạwcı/heart)/Noun
((tên/pulse)/Verbstrong
(p̄hid cạngh̄wa/irregularly)/Adv
……………………………………………………
Table 2. Show the probabilities of dt,i and wcSymt,ij with the CE-rel and nonCE-rel classes from Type1 and Type2 disease documents (where the values of the dt,i features and the wcSymt,ij features with the symptom-concept codes are d1,3 = HepatitisB, d1,7 = German-measles, d1,17 = TyphoidFever, d1,58 = Dysentery, d1,67 = Enteritis, d2,18 = diabetes, d2,41 = PulmonaryVascularDisease, d2,54 = Myocarditis, d2,63 = BrainDisease, d2,70 = Anemia, S6 = haveFever (person), S7 = haveHeadache (person), S10 = vomit (person), S28 = weaken (person), S31 = haveSweat (person)).
Table 2. Show the probabilities of dt,i and wcSymt,ij with the CE-rel and nonCE-rel classes from Type1 and Type2 disease documents (where the values of the dt,i features and the wcSymt,ij features with the symptom-concept codes are d1,3 = HepatitisB, d1,7 = German-measles, d1,17 = TyphoidFever, d1,58 = Dysentery, d1,67 = Enteritis, d2,18 = diabetes, d2,41 = PulmonaryVascularDisease, d2,54 = Myocarditis, d2,63 = BrainDisease, d2,70 = Anemia, S6 = haveFever (person), S7 = haveHeadache (person), S10 = vomit (person), S28 = weaken (person), S31 = haveSweat (person)).
dt,i (Type1)CE-RelNonCE-RelwcSymt,ijCE-RelNonCE-Rel
d1,30.05900.0601S60.07540.0237
d1,70.04650.0474S70.03440.0227
d1,170.07550.0769S100.04190.0350
d1,580.09620.0980S280.00840.0435
d1,670.06310.0643S310.01950.0047
dt,i (Type2)CE-relnonCE-relwcSymt,ijCE-relnonCE-rel
d2,410.02030.0216S60.02470.0734
d2,540.03650.0389S70.02470.0327
d2,630.03650.0389S100.02770.0471
d2,700.03970.0424S280.03980.0064
d2,180.03970.0424S310.00680.0199
Table 3. Show Fgroupz with the feature loading weights of sc feature elements where each yellow cell is a feature loading weight of a sc feature element ≥ |0.4|.
Table 3. Show Fgroupz with the feature loading weights of sc feature elements where each yellow cell is a feature loading weight of a sc feature element ≥ |0.4|.
SymptomFgroup18Fgroup19Fgroup20
S820.937−0.0170.067
S1120.6140.036−0.020
S30.573−0.041−0.366
S1360.479−0.0300.098
S710.479−0.077−0.071
S640.0000.9130.044
S140.0070.657−0.007
S33−0.0360.6360.035
S31−0.0300.484−0.052
S520.006−0.012−0.873
S51−0.023−0.030−0.858
S280.079−0.087−0.478
Table 4. Show di Factorz pairs having the highest FactorScorez (Yellow Cell) on each di.
Table 4. Show di Factorz pairs having the highest FactorScorez (Yellow Cell) on each di.
Case: diFactor17Factor18Factor19Factor20Factor21
d8−0.2358−0.3685−0.3134−0.47161.2682
d27−0.23582.26861.07900.3166−0.2318
d32−0.23580.5180−0.3134−0.4716−0.2318
d41−0.2358−0.3685−0.31343.0963−0.2318
d42−0.23585.5921−0.3134−0.4716−0.2318
d43−0.2358−0.3685−0.3134−0.4716−0.2318
d44−0.2358−0.3685−0.31343.0963−0.2318
d53−0.2358−0.36850.61683.0963−0.2318
d54−0.23580.4446−0.31340.3166−0.2318
d55−0.2358−0.36851.0790−0.4716−0.2318
d56−0.23581.8883−0.31343.0963−0.2318
Table 5. The wcSymt,ij Feature Determination from the first test sample based on Medical-Symptom-Expression List on Wikipedia.
Table 5. The wcSymt,ij Feature Determination from the first test sample based on Medical-Symptom-Expression List on Wikipedia.
Disease Type DocumentswcSymt,ij Feature Determination
Number of Different wcSymt,ij FeaturesPrecisionRecallF1-Score
Type1 Documents (17 d1,i documents)760.8920.7910.839
Type2 Documents (18 d2,i documents)1150.8870.7540.815
Table 6. Determination of dt,i wcSymt,ij Pairs Having CErel from the second test sample.
Table 6. Determination of dt,i wcSymt,ij Pairs Having CErel from the second test sample.
Determination of dt,i wcSymij Pairs Having CErelDisease Type
Type1 (4700 EDUs)Type2 (5300 EDUs)
NBPrecision0.7500.742
Recall0.7390.737
F1-score0.7450.740
SVMPrecision0.7570.745
Recall0.7470.742
F1-score0.7520.744
LRPrecision0.7620.743
Recall0.7590.742
F1-score0.7610.743
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pechsiri, C.; Piriyakul, R. Construction of Disease-Symptom Knowledge Graph from Web-Board Documents. Appl. Sci. 2022, 12, 6615. https://doi.org/10.3390/app12136615

AMA Style

Pechsiri C, Piriyakul R. Construction of Disease-Symptom Knowledge Graph from Web-Board Documents. Applied Sciences. 2022; 12(13):6615. https://doi.org/10.3390/app12136615

Chicago/Turabian Style

Pechsiri, Chaveevan, and Rapepun Piriyakul. 2022. "Construction of Disease-Symptom Knowledge Graph from Web-Board Documents" Applied Sciences 12, no. 13: 6615. https://doi.org/10.3390/app12136615

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop