Enhanced disease-disease association with information enriched disease representation

: Objective: Quantification of disease-disease association (DDA) enables the understanding of disease relationships for discovering disease progression and finding comorbidity. For effective DDA strength calculation, there is a need to address the main challenge of integration of various biomedical aspects of DDA is to obtain an information rich disease representation. Materials and Methods: An enhanced and integrated DDA framework is developed that integrates enriched literature-based with concept-based DDA representation. The literature component of the proposed framework uses PubMed abstracts and consists of improved neural network model that classifies DDAs for an enhanced literature-based DDA representation. Similarly, an ontology-based joint multi-source association embedding model is proposed in the ontology component using Disease Ontology (DO), UMLS, claims insurance, clinical notes etc. Results and Discussion: The obtained information rich disease representation is evaluated on different aspects of DDA datasets such as Gene, Variant, Gene Ontology (GO) and a human rated benchmark dataset. The DDA scores calculated using the proposed method achieved a high correlation mainly in gene-based dataset. The quantified scores also shown better correlation of 0.821, when evaluated on human rated 213 disease pairs. In addition, the generated disease representation is proved to have substantial effect on correlation of DDA scores for different categories of disease pairs. Conclusion: The enhanced context and semantic DDA framework provides an enriched disease representation, resulting in high correlated results with different DDA datasets. We have also presented the biological interpretation of disease pairs. The developed framework can also be used for deriving the strength of other biomedical associations.


Introduction
DDA acts as a key factor to understanding disease relationships, such as comorbidity, which is essentially the co-occurrence of diseases among the same patients that plays an important role in health care for drug discovery [1] and better treatment plan.To meet the emerging need, several studies in biomedical domain for relating diseases have been carried out [1,[2][3][4].In the work of Suratanee and Plaimas [3], a network-based approach was employed to calculate DDA strength that achieves a performance of 0.71 area under curve (AUC).Zitnik et al. [1] has predicted DDA relationships and found about 66 disease classes have significant high relationships with p-value < 0.001.Another work in [4], a disease similarity database tool was developed that performs hypergeometric test of p-values for different pairs of diseases.On the other hand, the DDA relationships were analysed using disease causality network.Further, the sorted potential association strength were compared between top and bottom group of disease pairs and found 95% of disease pairs in upper group.Since one disease can multiply into another in any patient, treating associated diseases is a great challenge for modern medicine.Hence, exploring DDA helps in gaining better insight of disease relationships, which is helpful for clinicians in proper diagnosis and treatment.
For better understanding of DDA, it is important to know the various underlying aspects with which diseases are associated.One such aspect considers biological entities such as other diseases [5], genes [4,6], pathways [7], drugs [8], and phenotypes [9] as intermediate factors, facilitating indirect DD association.Another aspect, revolves around the vast established heterogeneous biomedical databases such as biomedical datasets including Protein-Protein Interaction Network [4,10], HumanNet [11] and biomedical ontologies like DO [12], GO [13], Human Phenotype Ontology [14], Unified Medical Language System (UMLS) [15], Medical Subject Headings (MeSH) [16].On the other hand, connection between diseases can be inferred using biomedical text such as PubMed [17,18], MedLine [19], Clinical Notes, Claims Database and PubMed Central (PMC) [20], Electronic Health Records [21] and HealthMap Corpus [22].In order to widen the range of components affecting disease associations, non-Biomedical Text such as Wikipedia [17,23] has also been considered.
In addition, measuring the strength of DDA helps to improve the clinical decision making.As a quantitative measurement, disease similarity is generally used to indicate the extent to which the diseases are associated, since similar diseases are usually caused by similar semantic aspects such as similar etiology, markers, mechanisms, patterns etc.In this regard, by involving a single biological source, the strength of disease associations is computed by IC-Based methods such as Wang et al. [24], Resnik [25] and Lin [26], accomplished solely based on semantic associations of ontologies such as MeSH, DO, HPO.Taking advantage of biological process terms, some statistical-based approaches are proposed.In the work of Mathur and Dinakarpandian [27] calculated the association strength by overlapping genes of diseases using GO.In another work, association of diseases is computed using both information content and co-occurrence of terms in ontology [28].Recently some research employed neural network approach, word embedding model, to learn ontological node vector representations used in application of associating diseases through similarity values [29].Apart from ontologies, DDAs can also be quantified by mining a large corpus of biomedical literature.In the context of text, O'Shea [18] used a network-based shortest path distance method to calculate the relatedness between diseases from occurrence frequency of disease terms.Alternatively, using neural network-based approach, Beam et al. [20] derived distributional vector representations from clinical notes, insurance claims, journal articles and projected the learned context-based concept vector representations to distributional space for relatedness computation.Therefore, in general either semantic aspects or concept-based aspects have been considered for the calculation of DDA strength.However, considering both the above aspects could lead to more effective strength calculation.
Some efforts have been put-forth to combine different biomedical knowledge from various sources to derive representations of biomedical concepts for measuring the relatedness of the concepts.There are works that fused various biomedical knowledge such as biomedical entities, biomedical datasets and ontologies [30,31].On the other hand, with the growing biomedical literature, some work has attempted to compute relatedness of biomedical concepts, with an integrated vector representations mined from both literature and semantic ontological information [32,33].However, the integrated vector encoded only limited aspects of contextual relations from literature and semantic relations from ontology.Hence, in this paper, an integrated vector is derived covering a wide range of both contextual and semantic relations for an effective DDA strength calculation.
The structure of the paper is organized as follows: Section 2 briefly reviews the state-of-the-art methods related to biomedical association classification and strength computation.Subsequently, a set of datasets used in this work and the proposed DDA framework is described in detail in Sections 3 and 4 respectively.Section 5 presents the experimental results that evaluates the quantified DDA scores obtained using the proposed framework.Finally, an outline of conclusion is drawn in Section 6.

Literature-based approaches
Biomedical literature contains associations linking diseases with other diseases.Given their significance in health-oriented applications, it is imperative to investigate these digitized data to extract the type of association using text mining approach.Given a sentence and disease pair appearing within the sentence, the DDA type can be of 3 types: positive association, where there exists an explicit mention of association with words like association, comorbidity factors, complicatin, risk factors, etc., negative association, in which a negative word explicitly conveying that no relation exists between the two disease mentions and neutral or null association that does not state about any association between the co-occurring diseases.Towards this end, a number of literature-based methods have been proposed for the extraction of associations between different biomedical entities [17,[34][35][36][37].
The co-occurrence statistical technique, assumes that more the frequency of entities occurring together within abstract or sentence higher the chance of being positively associated [8].Li et al. [38] employed the co-occurrence statistics to detect disease-related associations.Rosário-Ferreira et al. [47] considered diseases to be related if they are co-mentioned in the abstract text.However, entities occurring together may not be semantically connected, and thus result in low precision [39][40][41].
Some manually or automatically formulated rules finds its role in the association extraction task.Lee et al. [42] and Song et al. [43] drafted number of rules manually for PPI and disease-gene relation extraction respectively.In addition, Tari et al. [44] used automatically created rules to identify the biomedical relations from MEDLINE abstracts.The major limitation of rule-based system is that it is difficult to create rules entailing all types of associations and moreover a deeper insight into the biomedical knowledge for creation of such rules is required.
However, with the huge set of annotated training text available for biomedical associations, machine learning approach can overcome the above limitations by its ability to learn relation patterns of sentences which can then automatically detect the association type in unseen texts.Bhasuran and Natarajan [45] used a supervised machine learning method for gene-disease association extraction, which required a large training set and was time-consuming.Zhang and Lu [46] and Rosário-Ferreira et al. [47] eliminated this deficiency by using a semi-supervised method, that utilized a small training set which learns DDA patterns from PubMed abstracts.However, machine learning (ML)-based methods require enormous manual efforts in designing biomedical relation features for the association extraction task as ML methods lack automatic feature extraction.
These issues were addressed by employing deep neural networks for efficient feature engineering in text-mining for curating number of biomedical relation types, as it involved an automatic feature learning process [35,[48][49][50].One of the popular deep neural network models, Convolutional Neural Network (CNN), was widely used for classifying whether sentences contain positive, negative or null associations between biomedical entities using sentence representation, where different representations of various local-level features captured at sentence-level and global-level features captured at corpus-level were used for classification [17,34,37].
A Multi-Channel Dependency based CNN extracted PPIs into positive and negative associations, where the sentence representation covered word embeddings trained only on global-level features from PubMed and PMC [35].Using additional embeddings from Wikipedia and MEDLINE, the Multi-Channel CNN (MCCNN) model classified DDI and PPI into positive associations such as effect, mechanism, etc and negative associations.An attempt was made to classify different biomedical associations such as gene-disease associations (GDAs) [34], using disease position as the only locallevel feature, DDAs [17] using Parts-of-Speech (POS) as additional feature and spice-disease [37] using Parts-of-Speech (POS) and chunk tag as additional local-level features.
However, only a limited number of local-level and global-level features were used in sentence representation for the sentence-level classification of biomedical associations into positive, negative and null.
Similar research considering local and global text and video features have been carried out in the work of Wang et al. [51] for video-text retrieval.In the text part, they considered only the encoded full text representation as global text feature and the decoded global representation is extracted as local text feature.In neither case, no various local-level features nor the global-level features of each word in given text is embedded.
Moreover, most of the above work, only classified associations and did not attempt to calculate the association strength.An attempt was made to calculate only the strength of positively correlated pairs using statistical [18] and pattern-based approaches [52].While literature-based approaches have mainly been used for the classification of biomedical associations, we need a concept-based approach for effective association strength calculation.

Concept-based approaches
Biomedical ontologies have integrated non-duplicative biomedical concept terms and medical data, providing a high coverage of biomedical concept terms which have been used to compute the semantic association strength between biomedical entities.Quantitative semantic association among diseases help clinicians gain a better knowledge of diseases, since semantically associated diseases reveal similar or common underlying attributes, that further help in proper treatment plan [31].Therefore, discovering the quantitative semantic biomedical associations using biomedical ontologies plays a crucial role in biomedical field [11,31].Some work has encoded conceptual sources for computing semantic associations.Wei et al. [53], Beam et al. [20] and Pakhomov et al. [54] used only unstructured corpora such as insurance claims, clinical notes, etc., to include the conceptual aspects into the association computation.While Wei et al. [53] exploited ontology only to retrieve disease concepts.With additional semantic relation types information, Yu et al. [33] attempted to associate biological entities with improved semantics.However, taxonomic relationships conveyed by ontologies are needed for an enhanced semantic association quantification.
Most of the ontology-based methods were node-based, edge/path-based and hybrid-based.The node-based approaches use properties of the node such as Information Content (IC) [25,55] and their variants [56][57][58] for computing semantic association between the concepts based on their lowest common ancestor.However, the IC values computation is based on the annotated corpus and hence is corpus dependant.On the other hand, the edge/path-based approach uses the edges count between the given concepts to measure the association.One such method proposed by Wu and Palmer [59], used the common path from root node to the least common ancestor node while Richardson et al. [60] used the edge weight technique based on node density, depth and connections between parent-child nodes for computing the conceptual associations.Further, Cheng et al. [61] proposed a weighted maximum common ancestor depth and Wu et al. [62] proposed a non-weighted maximum common ancestor depth to measure the semantic associations.Using the topology of DO, Wang et al. [63] calculated the strength of association by considering the semantic impact of ancestors on the entities involved in association.However, the problem with edge-based measure is that the concepts at same depth are not semantically well differentiated.As a hybrid measure, Mazandu and Mulder [64] used the topological positional characteristics of the GO for association strength calculation.Zhao and Wang [58] computed relatedness using the count of children nodes and topology of GO.Kamran and Naveed [65] also exploited the topology of GO along with common descendants to calculate the strength of associations.However, the computation of semantic relatedness using hybrid methods have not incorporated the semantic meaning of the concepts captured within the ontology.
Semantic associations based on semantic meaning of concepts can also be computed using vectors learnt from the ontological graph structure.Camacho-Collados et al. [66] used the graph-based vectors and computed the semantic association, where the vector representation is solely based on the structure of the graph.Guo et al. [67] and Zhong et al. [68], used graph embeddings which can capture the structural information connecting nodes in graph but no relationship information was considered.Smaili et al. [69] represented concepts by general corpus trained aggregated embeddings of all its annotated nodes including the ancestors, where there is no control on the amount of ancestorial information affecting the given concept.Hence, the problem with vector-based association is that representation of vectors has encoded only a limited ontological relationship information without any control of the contribution effect of the entities involved in the association.

Integrated approaches
Attempts have been made to measure association between diseases by integrating multiple data sources as well as fusing the details of various biological entities extracted from these biomedical sources.Su et al. [31] developed a joint association method combining biological entities such as genes, phenotypes and integrating ontological sources (DO, HPO), where semantic associations determine the disease associations.Similarly, Cheng et al. [30] spans different biomedical sources (DO, HumanNet) fusing functional and semantic associations for measuring the association strength.With the unprecedented growth of biomedical literature, there has been a significant gap between the increasing published scientific knowledge and the tailored biological data knowledge [70].Hence, it is necessary to integrate the contextual knowledge obtained from biomedical literature with the semantic knowledge of biological data sources for the DDA task.Deng et al. [71] used the biologicalprocess based approach, integrating both literature and ontology (GO) and proposed a combined score of semantic and contextual associations using symptoms, genes and their related functions.In addition, li et al. [72] proposed a relatedness method integrating contextual and functional associations mined from literature (MedLine) and biomedical network (PPI), respectively.Moreover, Jiang et al. [32] proposed a hybrid semantic embedding model incorporating both corpus-based distributional representation into multiple ontologies to gain a better similarity score of biomedical concepts.Similarly, Yu et al. [33] used neural network approach to induce the vector representation of biomedical concepts by retrofitting contextual information from literature (PubMed) using semantic information from ontology (UMLS) such that the resulting vectors can be utilized to measure the association strength.However, both Jiang et al. [32] and Yu et al. [33], generated the corpus-based representation for each concept independently without considering the different types of context (association) of the sentences.On the other hand, the ontological knowledge integrated by Jiang et al. [32], was only edgebased semantic similarity of concept pairs that did not incorporate semantic meaning of concepts as well as their ontological relationship connections.In addition, the existing methods associate the biomedical concepts (entities) using only a limited aspect of contextual and semantic relations, which results in low correlation with human judged association scores.
Thus, for the biomedical association quantification from literature, particularly DDA, the existing classification model has used only a limited number of local-level and global-level features that could capture only limited syntactic, semantic, and contextual features for sentence representation learning.Hence, in order to improve the classification performance, there is a need to include additional local and global-level features.The existing methods either not calculated or calculated only positive association strengths.However, it is important to quantify the strength of DDA pairs based on all types of DDA pairs positively, negatively, and null associated by sentence embeddings under different contexts.
Similarly, for concept-based quantification of DDA, existing methods embedded concepts by considering only the connectivity of concepts in ontology.The semantic meaning of concepts and the various ontological relationships affecting the associations not embedded.In addition, all ancestors are treated equally.However, controlling the impact of ancestorial embedding is important as each ancestor may either be closely or distantly related to each concept in the association.
The integrated approaches fusing literature and ontology, did not consider different context types of sentences from literature and did not incorporate multiple semantic meaning of concepts with ontological relationships.Moreover, the existing methods have fused only limited semantic type relations from ontology with limited contextual relations from literature.However, the association varies based on the taxonomical connection relationship type that exists in the ontology.Therefore, there is a need to integrate both contextual relations from literature and richer semantic relationships from ontologies for an enhanced DDA strength quantification.
Of significance, while there are existing association quantification methods that have fused semantic relations from ontology with contextual relations from literature, we improve the association quantification in this paper: 1) We enhanced literature-based DDA representation by considering all context types of association sentences such as positive, negative and null with improved sentence representation.
2) We also enhanced concept-based DDA representation by the proposed ontology-based joint multisource association representation where semantic meaning of concepts and the various ontological relationship connections are incorporated for a better DDA quantification.
3) We present an enhanced and integrated DDA framework to widen the coverage of various relationship aspects of association components both contextually A) and conceptually (semantically) B) to build an information enriched disease vector representation.

Collection of unlabeled PubMed abstracts
We initially used the available and already annotated 521 abstracts dataset [17] for training of the proposed ESEC-CNN model.However, in order to achieve better modelling, we expanded this dataset.To assist the DDA dataset expansion, an initial set of approximately 3 million bio-concept annotated disease-related PubMed abstracts have been extracted using PubTator.PubTator, an automatic textmining tool, recognize various biomedical entities such as genes/proteins, diseases, genetic variants, spices and chemicals in the titles, abstracts of PubMed articles [73].To ensure sentence-based DDA, only 39,510 abstracts with at least a DDA sentence are retained for further processing.

Disease ontology
DO, a taxonomy of diseases, in which each disease term is linked to another in a hierarchical manner by a semantic type "is_a" association has been used [12].DO mapping each disease term to its disease id DOID along with the term definition and the human disease related knowledge base is downloaded from http://purl.obolibrary.org/obo/doid/releases/2022-06-07/doid.owl(accessed 7 June 2022).In this work, the conceptual linking of diseases for concept-based DDA has been established using various DO relationships.Approximately 8000 diseases out of 14,958 diseases from the enhanced dataset were mapped to DO, whose corresponding term definitions are further utilized in concept embedding.

Unified Medical Language System
The UMLS consists of three components, Metathesaurus, Semantic Network and Lexicon tools, that has concepts with concept ID (CUI), definitions and its linkage to other concepts with semantic relations such as CHD "Child", SY "asserted synonymy", RN "has a narrower relationship", RO "has other relationship", RQ "related and possibly synonymous", etc.In this work, only Metathesaurus concepts file, containing the concept pairs relationships are used for concept embedding in conceptbased DDA [15].

Datasets for evaluation
We evaluate the obtained DDA scores of our approach against the results of DisGeNET, that contains about 10,48,575 DD pairs from a curated DDA database.DisGeNET defines DDAs based on shared genes and variants among the available gene-disease associations [74].This well-known database has been used for direct comparison of DDA strengths in both the perspectives.Nicia et al. [47] used DisGeNET to evaluate the results of DDAs obtained using SicknessMiner.The phenotypic similarity of diseases werealso evaluated using the DisGeNET scores for inborn errors of immunity [75].Further, we created a standard dataset, to compute DDA strength using functional GO as an association criteria.The disease-related GOs are obtained from CTD.Some of the attributes of the datasets are disease1, disease2 and the Jaccard similarity scores using genes, variants and GOs.In this work, we have adopted DisGeNET as well as the created standard dataset for evaluating DDA strength.
In addition, the performance of the obtained DDA strength of our approach is also evaluated using the human rated DDA pairs.Hence, a combined standard DDA dataset with human assessed scores is created using 213 disease-disease pairs obtained from UMNSRS [54] and MayoSRS [76], by mapping the concept terms to disease terms using CTD disease vocabulary [77].

Integrated and enhanced DDA strength quantification framework
The proposed work effectively measures the association strength between different diseases by integrating various types of disease-disease linking contextual and conceptual relations.In this work, contextual relationships are obtained from biomedical literature such as the PubMed abstracts.Similarly, biomedical databases (DO [12], UMLS [15] and biomedical text (Clinical Notes, Insurance Claims Database, Journal Articles) [20] are utilized to obtain conceptual relations.Deriving DDAs through integration of multiple linking perspectives associating the given disease pair and computing the aggregated DDA strength are important.
Figure 1 describes the proposed framework.With the list of diseases as main input, collection of associated PubMed abstracts is the first step.In Section 4.1, the proposed deep neural network model, Enhanced Sentence Embedding with Context-Based CNN (ESEC-CNN) is trained on preprocessed and labelled (positive, negative and null DD pairs) 521 PubMed abstracts [17].The built model is further exploited to classify a new set of PubMed abstracts collected iteratively.This dataset is used to improve the general performance of DDA prediction.This dataset is used to improve the general performance of DDA prediction.The set of classified DDAs and sentence embeddings obtained from the enhanced dataset are further utilized to construct literature-based DDA matrices.In addition, the enhanced list of diseases is also used for the construction of concept-based DDA matrix of DDA representations as described in Section 4.2.Using the biomedical text and biomedical databases, Ontology-based joint multi-source association embedding model is proposed to improve conceptbased DDA.The integration of literature-based and concept-based DDAs for DD association enhancement is described in Section 4.3 using a modified vector-similarity fusion method [78] to improve the quality of integrated disease vector.Finally, the relatedness score between DDs is calculated using cosine similarity of the integrated disease vector [79].The DDA dataset derived from initial 521 labelled abstracts are used for construction of enhanced literature-based DDA matrices using sentences with disease pairs classified into positive, negative and null pairs.For this classification, we proposed a neural network architecture as illustrated in Figure 2. The network is designed to capture syntactic and semantic information for a given sentence with DD pairs from three different perspectives using 1) Sentence-based local-level features At sentence-level, we have used Parts-of-Speech (POS) feature using one-hot encoding scheme represented by 11-bit binary vector [35] and two-dimensional disease distance feature [17].For DDA, new additional features such as dependency relations [80] and chunk [81] are included and Named Disease Entity (NDE) feature is obtained, similar to the work of Peng and Lu [35].The NDE feature is applied to each word in a sentence represented by a four dimensional encoding < D1, D2, D, O >, where D1 and D2, represents the disease pair under consideration.Other disease words and nondisease words are represented by D and O respectively.
2) Sentence-based global-level features Using a popular embedding model word2vec [82], the embedding of each word in a sentence is learnt at corpus-level using both domain-specific context such as PubMed and PMC and general contexts including news, in addition to Wikipedia [83].
3) Document-level features Similar to the work of Lai et al. [17], the traditional document features such as Bag-Of-Word, word-based Parts of Speech, NDE information and document-based information are represented using one-hot encoding.
Thus, in this work, an enhanced sentence embedding with additional features is framed that helps the proposed classification model in better classification of different types of association.In Figure 2, the input to ESEC-CNN is the embedding layer representing the sentence followed by convolution and pooling layers outputting an n-dimensional enhanced sentence embedding vector.Similarly, the document representation [17] of m-dimension is merged with enhanced sentence embedding to create (n + m) dimensional final single vector.The fully connected layer with categorical hinge loss in activation function [84] is applied to the obtained merged vector.The combined vector is further passed on to three-dimensional output layer representing the probability of classes: positive, negative, null.

Enhanced construction of literature-based DDA matrices
The trained classifier model is effectively utilized in our work to classify the new set of extracted PubMed abstracts.In order to improve the performance of DDA strength calculation, it is essential to widen the range of positive, negative and null contexts of DD pairs, therein, aggregating the contextual information contribution to the DD strength during the construction of enhanced literature-Based DDA matrix.Further, the number of seed diseases is also increased, thus we attempt to measure the strength of association between a larger number of DD pairs.The dataset is constructed by an iterative technique with initial 213 seed DD pairs collected from a combined benchmark datasets including UMNSRS Similarity and Relatedness [54], MayoSRS and MiniMayoSRS between Medical term pairs [76], until we obtain 58,980 unique DD pairs.
In order to effectively quantify DDA strength using literature, considering positive, negative and null associations is important as each type conveys different degrees of association.Hence, the DDA classes (positive, negative and null) predicted by LC-CNN model along with improved sentence representations are further utilized to construct two literature-based DDA matrices namely, literature-based positive, negative DDA matrix of DDA representations and literature-based null similarity matrix.
1) Literature-based positive, negative DDA matrix As discussed in Section 2.1, sentence-based biomedical associations are classified into only positive, negative [17,[34][35][36][37] or only as negative [36].While during the strength calculation, O'Shea [18] and Xu et al. [52] considered only positively correlated pairs.However, it is important to calculate the strength of association of pairs that occur in both positive and negative contexts and those that occur only in negative context.Considering the above aspects, cumulative association strength is calculated in Eq (1).
where:  represents association vector of disease pair   ,  and  is the number of positive contexts and negative contexts respectively. and  denote enhanced sentence representations with two disease mentions vector in positive and negative cases respectively.The association strength of disease pair   , is dealt differently if it falls in any of the three cases.1) combines case 1 and case 2 using an association modification factor (-) that modifies association strength if   occurs in both positive and negative contexts.2) Literature-based null similarity matrix Though Rakhi et al. [37] has classified sentence-based biomedical entity pairs as null, these associations were not considered while calculating the strength of association.However, null pairs with unmentioned associations may also be associated with some strength and hence needs to be taken into consideration.In addition, in this work, we have also extended the concept of null association within same sentence [17,34,37] to across different sentences having single disease mention and therefore, including corresponding embedding information also contributes to DDA strength computation.Accordingly, we have derived an equation Eq (2) representing a disease vector.
where:   denote the disease vector representation of disease  ,    denote two-disease and single disease mention enhanced sentence representations.
The represented disease vector   , consists of 2 important components in the context of DDA as follows:  ∑  , accumulates enhanced sentence representations of  when it occurs in the same sentence with all other unmentioned or null associated diseases. ∑  , accumulates enhanced sentence representations of  when it occurs as single disease mention in sentences.  is calculated in the same way and   strength is calculated using cosine similarity, cos   ,   , that helps modify DDA with null associations and discover DDAs that are not directly conveyed by positive/negative associations.Using Eqs (1) and ( 2) described in 1) and 2), we are able to construct an enhanced literature-based positive, negative DDA matrix and literature-based null similarity matrix shown in Figure 3. that is later used to calculate literature-based DDA strength.

Concept-based DDA using ontology-based joint multi-source embedding model
In order to integrate conceptual aspects for DDA calculation, a detailed ontological mapping covering a wide range of taxonomic relationships, plays a vital role and contributes to the quantification of semantic associations between diseases.Some of the taxonomical ontological relationships include ancestorial parent-child relationship and other relationships like sibling and indirect relationships (uncle, cousin).Wang et al. [63] has not considered the semantic relationship in disease association measurement while only parent-child relationship is considered in the prediction of onset of diseases [85,86].For DDA, in this work, we consider ancestorial and other closely related taxonomical relationships to derive a better degree of association linking diseases.Given DO as a DAG, having nodes corresponding to the ancestors and disease concepts    involved in   association, the ancestorial relationship and ontological relationship connection between  (disease concept1),  disease-concept2) are used to learn the association representation.
For DDA measurement, when we embed each disease (concept), we need to do so in relation to a disease pair.For this, the connectedness of concepts [68] and semantic information of all ancestors are used [29,85,86].However, discovering new ancestors sets __ , prior to association representation is important as not all ancestors contribute to the final association.
After discovering the ancestors sets, we introduce a 2-stage DDA quantification, ontology-based joint multi-source association representation, shown in

Discovery of new ancestors sets
As discussed in Section 4.2, including all ancestors of given disease concept may cause semantic contribution of even the concepts that are not common between diseases in the disease pair and hence, embedding of disease under consideration may lead to incorrect association.In order to tackle this aspect, that is, rather than considering all ancestors of a particular node in the ontology, we consider only those ancestors that contribute to the association between diseases by defining new sets of ancestors __  and__  for  and  respectively for   association.Therefore, the derived ancestors set _  of disease  in   association is described in Eq (3), where only common ancestors   are considered since two diseases are associated by sharing of common diseases in the DO.In addition, the ancestors on the longest path   with respect to  is also considered to cover a broader etiology of the disease concept.
where   ,  denotes the common ancestors of  and  .
Further, by utilizing the discovered ancestors sets, ontology-based joint multi-source association embedding model is proposed, consisting of 2 stages, described in sub-sections 4.2.2 and 4.2.3.

Ontology-based joint multi-source association embedding model
Stage-1 Novel-ancestorial level-based DDA quantification using multi-source embeddings Figure 5 shows the derived embedded association representation,  for two disease nodes in the given DO, where the representation is divided into two components, A) Multi-source ancestorial Embedding and B) Novel ancestorial level-weight for each of the diseases    respectively, discussed in following sections. ssociation may be influenced by several factors such as symptoms, biological entities (genes, proteins, etc.), other diseases, affected patient records, etc., which can be covered by infusing embeddings from different sources.In addition, considering multi-source information of root node and leaf node ( ) is important in the context of DDA as root node is common to both  and  and leaf node  is involved in    ssociation.As shown in Figure 5, the multi-source ancestorial embedding of  ∈ __  is given by the component A, in which we assign multi-source contextual embeddings  ,  from DO and biomedical text [3] and semantic embedding  from UMLS [33].For embedding text definition from DO, in this work, we adopted the procedure used by Park et al. [23] to fill in the definition of diseases using the first lead paragraph from Wikipedia, applying an embedding method, Doc2Vec [87].The combined semantic and contextual information is then infused into the deep neural network embedding model through attention mechanism [85,86] The next component of stage-1, controls the semantic and contextual contribution effect of each ancestor by adding level-weight to the aggregated multi-source embeddings obtained using component A. We used the ancestorial level-weights similar to Wang et al. [63] (relative positions in MeSH) and Kamran et al. [65].Wang et al. [63] and Kamran and Naveed [65], calculated the ancestorial levelweight by choosing the maximum of level-weights among all children of ancestor with respect to each entity in association.This may lead to assigning level-weight of ancestor by children which may be neither common nor on the longest path to  and  , thus failing to include level-weights of nodes contributing to the association.Thus, selecting the level-weight contributed by children that are common ancestors and those that fall into longest path with respect to  and  , __ of    , reveals the actual semantic value or level-weight of ancestors.As a special case of computing level-weight of least common subsumer (LCS), Kamran et al. [65], calculated the semantic value or level-weight of LCS by considering only the level-weights of the ancestors on the longest path from root to LCS which included only the influential effect of ancestors of LCS.However, this will not help in identifying the true level weight of LCS with respect to each of the descendant entities in association.Therefore, for computing the level-weight of LCS, it is required to consider levelweights of children of LCS on deeper or longest path that connects LCS with each of its descendant entities in association as it reveals the actual semantic value of LCS.Therefore, in this work, a novel ancestorial level-weight contributing to the association strength is derived and is denoted by component B in Figure 5 and given in equation Eq (5) for ancestor  w.r.t  .
Therefore, in this work, a novel ancestorial level-weight contributing to the association strength is derived and is denoted by component B in Figure 5 and given in equation Eq (5) for ancestor  w.r.t  .
where ∆ is the weight factor of the edge linking  with its child .The weight factor helps reduce the contribution effect of ancestors that are distant from  , ranging from 0 to 1 and we found that ∆ 0.4 gives better correlation with the standard DDA scores from DisGeNET.Similarly, levelweight of ancestor  w.r.t  is derived.Finally, the derived two components in Section 4.2.2 are then multiplied to get the final association representation,  , for   association.With the derived   association vector  , we further vary the association based on the connectedness ontological relationship between    , using an additional DDA quantification described in the following Section 4.2.3.

Stage-2 Ontological relationship connection-based DDA quantification
Given a disease pair   , whose association can be established through other diseases in the ontology using ancestorial relationship without considering the variation factor is discussed in Section 4.2.2.However, the type of ontological relationship connection between    , reveals the actual association.Hence, varying the association based on type of the relationship connection, provides a finer adjustment to the already derived association vector  .Therefore, in this work, we proposed an ontological relationship variation factor (ORVF) for the second level of DDA quantification.
As a diagrammatic illustration, ORVF values for different types of ontological relationship connections are shown in Figure 6.In Figure 6(a), the ORVF is 0 when both    are at same distances 0.1 or immediate children of  , considering the edge weight as 0.1.Similarly, in Figure 6(b), the ORVF is 0 as  is the direct parent of  , with a distance 0.1.Thus, ORVF 0, represents that there is no variation of association when    are very closely related as a sibling and direct parent-child relationships.However, the variation occurs when    are distantly related.For example, the ORVF values are calculated for the indirect relationships shown in Figure 6(c) (d) and (e).In Figure 6(c),  acts as grandparent of  , producing ORVF 0.2 as  is at a distance of 0.2 from  , while an uncle relationship connection in Figure 6 (d), calculated ORVF of 0.3as an aggregation of distances 0.1 and 0.2 with respect to and  respectively, from LCS( , )  ).On the other hand, in Figure 6(e),  acts as a cousin of  resulting in ORVF of 0.4 as both  and  are at distance 0.2 from LCS ( , )  ).Thus, ORVF helps in varying the extent of DD association by each Ds independent distance from LCS.
Algorithm 1 summarizes the procedure of adjusting the stage-1 association vector  by the proposed ORVF is as follows.An illustration of the above algorithm is given in Figure 7(a) and (b) showing the ORVF calculations for sibling and cousin ontological relationships connecting    respectively.
Figure 7(a) and (b) follows the same procedure to compute ORVF.The first step gives the   ,  ) denoted as  , by defining the ℎ  and ℎ  using the new ancestors sets of  and  respectively, where  is equal to and  corresponding to Figure 7(a) and (b).The next step is to find the distance of  from  and  independently using

𝐿𝑒𝑣𝑒𝑙 𝐿𝐶𝑆
and   and found to be 0.1 for sibling relationship in Figure 7(a) and found to be of different distances 0.1 and 0.2 for cousin relationship in Figure 7(b).Finally, with the calculated distances, the ORVF is computed for direct/sibling relationships in Figure 7(a) and for broader/indirect relationships in Figure 7(b).For direct/sibling relationship, the association embedding is not varied since ORVF is 0 whereas the association embedding is reduced by a factor of 0.3 which is the total distance of variation between    , through  .Hence, the association embedding  is the final association embedding  in case of sibling relationship connection in Figure 7(a) whereas  is reduced by a factor of 0.3 contributed by 0.1 and 0.2 from  from    respectively.Using  as shown in Figure 4 and the proposed ORVF, we are able to construct an enhanced concept-based DDA matrix of DDA representations  that is later used for conceptbased DDA strength.

Integration and Enhancement of final disease vector representation
Finally, an information rich single disease vector of  in   Association, can be obtained as shown in Figure 8, by the following steps.Extracting literature-based  vectors, from the constructed literature-based positive, negative DD association matrix of  and concept-based  vectors from concept-based DD association matrix of  as discussed in Sections 4.1.2and 4.2.Further, the extracted  vectors are integrated into single integrated disease vector.As an enhancement to final DDA strength, the integrated single disease vector is enhanced with additional contextual information obtained from literature-based null DD similarity matrix in Section 4.1.2,using vector-similarity fusion method, in order to obtain the final DDA strength.

Extraction of single disease vector
For   association, literature-based single disease vector  of  with respect to  , is extracted using association vectors obtained from literature-based positive, negative association matrix in Eq (1) by averaging the literature based DDA vectors  ′ of   associations, where  ← 1,2, . .,     and finally concatenating the averaged component with association vector  of   association as shown in Eq (6).For   association, it is important to preserve the actual information component of  through concatenation while representing  vector.Similarly, single disease vector for  is extracted.A similar strategy is followed while extracting disease vector for    from concept-based DDA matrix, where  of  with respect to  is shown in Eq (7).

∑
,   (7) where:  and  are literature-based and concept-based association vector of   pair. represents literature-based single disease vector of  with respect to  .Similarly, ′ represents concept-based single disease vector of  with respect to  .

Integration of single disease vectors
For an information-enriched disease vector representation, the extracted literature-based and concept-based single disease vectors are integrated into a single information rich disease vector.However, for disease vector representation, only a narrow disease-disease linking relations were fused [32,33].In order to achieve better association, in this work, the disease vector is represented by integrating vector representations on a wide range of disease-disease linking information from both literature and concept-based biomedical data sources.
Thus, for an information-enriched representation of diseases in   association, the extracted literature-based and concept-based disease vector components in Eqs ( 6) and ( 7), respectively, are concatenated into a single integrated disease vector  for  with respect to  as in Eq (8).
where  represents the single integrated disease vector  with respect to  . represents literature-based single disease vector of  with respect to  .′ represents concept-based single disease vector of  with respect to  .Similarly,  for  with respect to  can be defined using Eq (8).

Enhancement to the integrated disease vector
In addition, the information-enriched integrated disease vector is enhanced with additional contextual relationship with all other diseases obtained from literature-based DD null similarity matrix derived earlier in as discussed in Section 4.1.2.Manchanda and Anand [78] enhanced the disease vector representation by updating the initial vector representation using only literature (PubMed) with the corresponding similarity information with all other diseases.Enhancing such a low informative disease vector with similarity is needed to produce a proper enhanced disease vector.Hence, in this work, we use the information-enriched integrated disease vector derived in Eq (8) as an initial vector for similarity updation using vector-similarity fusion method defined in Eq (9), that uses an objective function [rep learning paper], where the scalar component is replaced by the null similarity scores.
Thus, the enhanced integrated vector  ′ for  with respect to  in   association is obtained from  in Eq (8) when updated if the objective function  is minimized as shown in Eq (9) where  represents the integrated disease vector  with respect to  ,  , denotes the literature-based null similarity scores between  and  ,  denote length of vector  , Similarly, the enhanced integrated vector  ′ for  with respect to  in   association is updated when the objective function  is minimized.Thus, a rich integrated and enhanced disease vector representation is derived that helps DDA both contextually and semantically, leading to a better quality of final DDA Strength.

Relatedness calculation
Finally, with the enhanced-integrated disease vector representations obtained in Section 4.3.3, a cosine similarity is applied to obtain the final score measuring the actual strength of association for the given disease pair as shown in Eq (10).
where  ′ and  ′ represent enhanced integrated disease vector  with respect to  and  with respect to  respectively.Therefore, in this section, instead of finding the embedding vector for a disease in isolation, we used a modified method similar to Manchanda and Anand [78], in which the disease embedding is discovered in relation with DDA.We used an integration of literature-based and concept-based conceptual and semantic multi-source embeddings and richer ontological embeddings to obtain and discover DD associations and derive their strengths.

Results and discussion
For evaluating the enhanced DDA framework, we first evaluate the performance of the proposed association classification model ESEC-CNN with improved sentence representation, which on training facilitated the construction of enhanced DDAE dataset.The classification model was evaluated by measuring the model's classification performance using Precision, Recall and F-measure.The correlation between the association scores obtained from the enhanced literature-based DDA representations and the association metrics Wang et al. [24], Resnik [25], Schlicker et al. [88] and Lin [26] is evaluated using spearman's rank correlation coefficient.Second, the enhanced concept-based DDA representations is evaluated on both established biomedical dataset DisGeNet and human-rated DDA datasets using spearman's rank correlation coefficient.Third, the evaluation of single disease vector representation is carried out using literature and concept-based approaches independently and using the integration of both in a similar manner.Finally, the quantification of DDA pairs obtained using the enhanced single disease vector representation is compared to the state-of-art methods and evaluated in different perspectives of DDA criteria.Additionally, we have also shown the biological effect of the DDA scores derived by integrated and enhanced disease vector representation for mostly associated disease pairs category-wise.

Evaluation of improved sentence representation
We conducted experiments to show the effect of additional features in sentence representation using classification performance of various sentence classification models in Table 1 and also in Figure 9. DDA classification performance of the baseline models without (limited local and global-level features) and with (additional local and global-level features) improved sentence representation such as, LSTM [49], BiLSTM [89], CNN [90], BERT [91], BioBERT [92] and LC-CNN [17] are then evaluated on the available annotated DDA dataset, on a 5-fold cross validation.Implementation is carried out on a TensorFlow with hyperparameters of learning rate as 0.025, batch size of 8, epochs of 5,10,15 and layer size of 352.
On comparing with all classification models, CNN-based models are found to perform better as LSTM, BiLSTM are sequence-based and hence, CNN-based model shows better sentence classification performance.
The LC-CNN model with additional news embedding feature (global-level) has shown only less improvement of F-measure than that of LC-CNN with limited features.With the combined additional local-level embeddings of NDE, dependency relation, chunk tag along with other global-level embeddings including news, ESEC-CNN model (LC-CNN model with improved sentence representation) outperformed the other baseline models including LC-CNN model without improved sentence representation with F-measure of 85.54%.
A notable observation of F-measure in other baseline models show that models have achieved better F-measure when the sentence representation is improved with additional local and global level features.Hence, the effect of improved sentence representation has a major positive effect on other models also.The better performing ESEC-CNN model (LC-CNN with improved sentence representation) is further utilized for DDA dataset expansion, where the size of the labelled PubMed abstracts is increased using an initial 213 seed DD pairs obtained from a combined benchmark similarity dataset as discussed in Section 4.1.2.
From PubTator, a set of abstracts are downloaded in BioCXML format from https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/PubTatorCentral_BioCXML/BioCXML.9.tar (accessed 12 July 2022), ensuring only abstracts that contain sentences with the given DD pairs are retrieved.At each iteration, a new unique set of DD pairs are produced from the retrieved set of abstracts.The number of newly produced DD pairs are found to increase at the initial few iterations and the drop in the count of new DD pairs acts as a stopping criterion for the abstracts retrieval process.
With the retrieved 39,510 abstracts, a total of 58,980 unique DD pairs are identified.However, for the construction of increased DDA extraction (DDAE) dataset, the LC-CNN model with improved sentence representation is trained on the available labelled abstracts [17] and then applied on to the created dataset.The trained model is able to identify a large number positive, negative and null pairs with only a minimum number of seed pairs.A statistical comparison of the enhanced constructed DDA extraction (DDAE) dataset starting with the available 521 labelled DDAE dataset [17] is tabulated in Table 2.  DD pairs classified by ESEC-CNN model are of 3 types, namely, both positively and negatively associated, only negatively associated and null associated and their association scores are validated as discussed earlier in this section and the evaluation of the 3 types is shown in Tables 3-5 respectively.The association measures are calculated using DOSim package [5].Further, the concordance of the classified DD pairs scores with each of the association metrics is evaluated on both 521 DDA labelled abstracts [17] and the constructed DDAE dataset.
For both positively and negatively associated DD pairs, as shown in Table 3, [47] and [18] derived DDA strength which are less correlated with all metrics when evaluated on both datasets with a count of 54 and 3254 DD pairs.The lower correlation is because, Sicknessminer considered the number of co-mentions ignoring the context and treated all co-mentioned pairs as equally contributing to DD association, while Gextext considered a direct positive association if the DD pair had an average occurrence in the whole corpus, thus missing out the negative context of the pairs.Hence, considering negative context for association quantification will balance the real context by which disease pairs are associated.Further, such consideration could lead to significant correlation achieved by association scores computed using literature-based positive, negative association matrix.In case of only negatively associated DD pairs as shown in Table 4, a total of 70 and 2234 DD pairs were found from the available and enhanced DDAE dataset, respectively, where their derived scores from the literature-based DDA matrix are positively correlated while other literature-based scores are negatively correlated indicating that considering the context of DD pairs occurrence plays a crucial role rather than taking only their occurrence frequency as in other methods.
Similarly, the associations discovered for 2649 (521 abstracts) and 36102 (enhanced dataset) null pairs from literature-based null similarity matrix, have also correlated better when compared to other methods shown in Table 5, as only few pairs co-occur and therefore Sicknessminer [47] which used the co-mention analysis for association, is less correlated.While GexText [18], resulted in strong association for DD pairs with higher occurrence in the corpus which may not be strongly associated and hence less correlated compared to our null similarity scores, as the null scores obtained, considered the surrounding context influencing the disease in the given pair.On the other hand, GloVe [93] generated less informative embeddings for association calculation and therefore less correlated in all Volume 20, Issue 5, 8892-8932.the above cases.

Concept-based DDA evaluation
To characterize the concept-based DDA, the derived association embedding consisting of several components such as discovered new ancestors sets, mutli-source ancestorial embedding with root and leaf node, novel ancestorial level-based DDA quantification and finally, the proposed ontology-based joint multi-source association representation with the ontological relationship connections is evaluated with the association scores from DisGeNET and the human assessed combined dataset as discussed in Section 3.4.

Evaluation of discovered ancestors sets for DDA quantification
Table 6 shows the correlation effect of varying combinations of ancestors sets for  and in  - association quantification.With diseases in ontology, 7936 DD pairs found in common with DisGeNET, the embeddings derived with discovered new ancestors sets are better correlated compared to embeddings with all ancestors [86].While considering only common ancestors without ancestors on longest path to  and  independently, shows good correlation than all ancestors but still less correlated when compared with the new ancestors sets.Since, association is not only influenced by commonality but also by ancestors on the longest path to each of the disease.Table 6.Comparison of the effect of the new discovered ancestors sets to other ancestors sets of  and  for  - quantification using spearman's rank correlation between association scores of DDA pairs obtained using different ancestors sets and DisGeNET DDA scores.

Evaluation of multi-source embeddings for DDA quantification
With the best correlated newly defined ancestors sets and with all ancestors, the concept embeddings are further evaluated to show the effect of multi-source embeddings of those ancestors with and without including multi-source information of root and leaf nodes.In this regard, the concept embeddings are evaluated as shown in Table 7 for 2658 DD pairs.We observed that the concept embeddings using ancestorial embeddings from multiple conceptual sources including root node and leaf node multiple embeddings in addition to the new ancestors sets gives significantly higher correlation compared to the baseline that considers only semantic sources [86].
Table 7.Comparison of the effect of multi-source embeddings of ancestors with/without multi-source embeddings of root node and leaf node  or  for  - quantification using spearman's rank correlation between association scores of DDA pairs obtained using multi-source ancestorial embeddings and DisGeNET DDA scores.In order to evaluate the effect of level-weight or semantic value of LCS ( , ) in  - association, we compared the level-weight of LCS computed by longest path of lower DAG with respect to  and  separately using the proposed novel level-weight and the upper DAG using Baseline_LCA in GOntoSim [65] as shown in Table 8.The results show that the DDA quantification by level-weight of LCS using lower part of DAG connecting  and  is better correlated with DisGeNET DDA scores than level-weight of LCS using upper part of DAG.
To demonstrate the effectiveness of adding novel level-weight to the multi-source ancestorial embeddings, we first introduce the effect of varying level-weight calculations of ancestors including LCS based on selection of children and then evaluated the effect of various combinations of levelweight with and without multi-source ancestorial embeddings.As shown in Table 9, with 1,75,939 DD pairs, the novel level weight, where the level-weight is contributed by the children that belongs only to the newly defined ancestors sets, even without multi-source ancestorial embeddings outperformed the baseline level-weight calculation [63].In addition, the correlation is even better when the novel level-weight is applied on multi-source ancestorial embeddings.In order to showcase the effect of the proposed ORVF, different combinations of the relationship connections as discussed earlier in Section 4.2.3.are considered.The performance of the effect of various ontological relationships is then evaluated through DDA quantification on DDA dataset as shown in Table 10.
Further, the proposed ontology-based joint multi-source association representation is evaluated against the state-of-art concept representation methods to project the effect of varying the ontological relationship connection of the given disease pair applied on to the association embedding derived by combining all the better performed components inferred from the sub-experiments as discussed earlier and is shown in Table 10 for 1756 DD pairs.The proposed model considering all subcomponents such as discovered new ancestors sets, multi-source ancestorial embedding with root and leaf node, novel ancestorial level-based DDA quantification and the ontological relationship connections is strongly correlated than other existing methods, because [86] considered only semantic ancestorial embeddings without level weight on all ancestors, is less correlated compared to other methods that considered other contextual and semantic type relations.[86] 0.703 Cui2vec (Clinical Notes, Claims Insurance, Journal articles) [20] 0.772 Retrofitted concept vector representation (PubMed, UMLS) [33] 0

Evaluation of literature-based, concept-based and integrated approaches of disease representation for DDA quantification
The analysis presented so far shows the effectiveness of literature-based DDA and concept-based DDA.However, we need to evaluate integrated literature and concept based DDA representation.This requires representing each disease as a single disease vector representation, integrating literature-based and concept-based methods.This enhanced single vector representation of two diseases is then used to compute the DD association using cosine similarity.In order to show the effect of integrated disease representation, the association scores computed is compared with the other state-of-art methods using only literature-based, only concept-based and those with integrated literature-based and concept-based perspectives.
The disease representations produced by the models is evaluated across different perspectives of datasets.On the basis of type of DDA criteria, various angles of the datasets are used to evaluate the scores obtained by the generated disease representations.In this regard, we relied on disease-related biological domain database DisGeNet, where two association criteria were used to derive DDA scores.One is the disease-associated genes and other is disease-associated variants.Further, the Jaccard index similarity is used to compute association scores.In addition, we created a standard dataset covering the functional aspects of DDA using GO function.The disease-related GOs are obtained from Comparative Toxicogenomics Database (CTD).In order to calculate the DDA score in GO perspective, we employed the Jaccard index.Finally, we also evaluated against the human rated DD pairs obtained from a benchmark dataset.Details of the datasets used is discussed earlier in Section 3.4.

Table 11.
Comparison of different aspects of disease vector representations using spearman's rank correlation between association scores of DDA pairs obtained across various angles of association criteria using DisGeNet (Gene and Variants), Standard dataset (GO) and human assessed scores.Table 11 summarizes the results of correlation of DDA scores obtained by different methods across various aspects of datasets.The DDA scores derived using only literature-based disease representation, shows better correlation than other literature-based method for DDA quantification in case of Gene-based, GO-based and human-rated scores.The reason may be that considering different context types in which DD pairs occur has a major influence on DDA scores as the additional features during the sentence representation learning can lead to better classified contexts.While, the correlation result on Variant-based dataset, is found to be less as the PubMed abstracts taken may not contain sentences that reveal much about variant related information or only limited contexts since we consider only disease mentioned sentences.

Disease
The DDA scores derived using only concept-based representation, found to have better correlation on all aspects of the datasets with only a slightly higher on variant-based.The proposed ontologybased method tries to embed a narrow information of concepts in ontology rather than generic concepts.This is achieved by controlling the contribution of ancestors on DDA in addition to varying the effect of different taxonomic relationships in ontology.Moreover, we select ancestors with respect to DDA rather than independently with respect to each of the diseases.All these has a major positive effect on DDA scores in different aspects.
On evaluation with the integrated approaches, the proposed method outperforms well compared to other baseline methods on all aspects of datasets.Integrating the enhanced literature-based contextual relations with enriched semantic relationships gives a broader coverage of relationships that might cover various influential factors affecting DDA.This basically includes indirect relationship information that can jointly eliminate false positives.Hence, the proposed work has shown promising results even for different aspects of DDA.

Implementation
The configurations of the machine include Intel(R) Xenon(R) 3.60 GHz (GPU), 64-bit OS (system) and 64 GB RAM (memory).Our system uses Python to implement the models.For literaturebased DDA classification as discussed in Section 5.1.1,Table 12 shows the time taken by the baseline models and the proposed model for training and prediction tasks.On observation, we found that CNN models take less training time compared with other models since it involves less parameters calculation.However, LC-CNN and the proposed ESEC-CNN models, take almost equal time since only additional features have been added in the input sentence representation in ESEC-CNN model.
For concept-based DDA representation as discussed in Section 5.2.4, the proposed ontologybased joint multi-source embedding representation takes on an average of 22 seconds to derive DDA representation which is higher compared to other models.This arises from calculating different ancestors' information as discussed in earlier sections such as level weight, attention weights as well as the various ontological relationships to generate final representation of DDA.Other concept-based base-line models such as Cui2vec [20], Retrofitted concept vector representation [33] takes less time than MMORE [86] and the proposed model, as the former does not consider the deeper ancestors' information and ontological relationships.Compared with MMORE, the proposed model takes much more time since additional computation of ancestorial level weights and ontological relationships effect are involved.Though the proposed model, takes some time to obtain DDA representation, it is still able to produce quality embedding whose effectiveness is proved by the correlated results in Table 10.

Biological analysis
The significance of DDA scores obtained by the proposed framework is analysed in biological aspects: listing top 20 associated disease-disease pairs with normalized scores in Table 13, diseasewise most associated diseases in Table 13, top 5 category-wise associations and also the top 10 associated diseases with corresponding categories for a given disease.
For a given disease, Table 14 shows the most associated disease pairs comparatively to others.The performance of disease representation in DDA quantification is further validated by disease categories, where the diseases are classified according to top 14-level DO categories such as "disease of cellular proliferation", "nervous system disease", "cardiovascular system disease", "musculoskeletal system disease", "endocrine system disease" and so on [72].The strength of association between disease categories is measured by averaging the normalized DDA scores between disease categories.The disease category pairs are ranked based on the normalized score.
We find that disease associated within same category have high average association score than with diseases of other categories as shown in Table 15.On observation, diseases in "nervous system disease" category have relatively higher association scores across all other disease categories.On the other hand, we find that average association scores of diseases in "disease by infectious agent", "endocrine system disease", "urinary system disease" have lower association scores with all other categories compared to diseases within itself.In case of "nervous system disease" category, is comparatively higher within and with "cardiovascular system disease" and "musculoskeletal system disease".While the average association score of diseases in "disease of cellular proliferation", are far lower with diseases in "endocrine system disease" and "cardiovascular system disease" than for other Volume 20, Issue 5, 8892-8932.categories.In addition, we have also shown the category-wise top 10 associated disease pairs for "Diabetes mellitus" of "endocrine system disease" and "cardiovascular disease" of "Cardiovascular system disease" in table 16.

Conclusions
Representing a richer quality of disease vectors for a qualitative and quantitative measurement of DDA strength provides valuable information to the clinicians for better healthcare planning.The existing methods of integrated vector representation failed to consider various sentence contexts from literature and semantic embedding of concepts along with different ontological relationship connections from ontology for better quantification of biomedical associations.To address this issue, in this paper, we presented an enhanced and integrated DDA framework incorporating various types of sentence contexts such as positive, negative and null from literature with semantically embedded concepts and various ontological relationship connections affecting associations from ontology for a richer quality of disease vector representation.The enriched disease vectors achieved well correlated DDA scores especially on gene-based when evaluated in different aspects of datasets compared to other baseline literature-based, concept-based and integrated representations.Moreover, we also shown the top associated disease pairs and category-pairs.Any biomedical association quantification using biomedical entities representations could greatly be benefited from a richer vector representation using the enhanced and integrated framework.In future, the integrated representation can also be carried out for determining the strength of other biomedical associations such as disease-gene, genegene, disease-symptoms etc.

Figure 1 .
Figure 1.The proposed framework for calculating DDA.

Figure 3 .
Figure 3. Literature-based matrices with association vector  and similarity score  .

Figure 4 .
In stage-1, we have included the association effect of the influential factors by infusing multi-source semantic (DO, UMLS) and contextual information (clinical notes, insurance claims, journal articles) of ancestors including the root ancestor node and leaf node.In addition, we add novel level-weight to the multi-source ancestorial representation, where the level-weight is based on new ancestors sets __ discovered initially, thus producing an association embedding matrix.In stage-2, we introduce ontological relationship connection-based DDA quantification that varies the embedded association strength between diseases based on their type of relation connection in the ontology, thus resulting in conceptbased association matrix of DDA representations.Thus, in this work, we try to improve the concept-based DDA by constructing a concept-based DDA matrix of DDA representations using ontology-based joint multi-source association embedding model as shown in Figure 4.

Figure 4 .
Figure 4. Pipeline of concept-based DDA using proposed ontology joint multi-source association representation.

Figure 6 .
Figure 6.ORVF calculation for different types of ontological relationship connections between    .

Figure 7 .
Figure 7. Adjusting association vector  by the proposed ORVF for sibling relationship (left) and cousin relationship connection (right).

Figure 8 .
Figure 8. Integration and enhancement of final disease vector representation.

Figure 9 .
Figure 9. DDA classification performance of baseline models without improved sentence representation and proposed ESEC-CNN model with improved sentence representation.

Table 8 .
Comparison of the effect of upper and lower DAG-based level-weight or semantic value computation of LCS( , ) in  - association quantification using spearman's rank correlation between obtained association scores of DDA pairs by varying level-weight of LCS and and DDA scores from DisGeNET.Calculation of level-weight of LCS( D , D ) for D -D quantification Spearman's rank correlation N = 1,75,939 DD pairs (DisGeNET) Baseline_LCA of GOntoSim: using upper DAG Level-weight of LCS( D , D ) by ancestors on longest path to LCS(D , D ) [65] 0.773 Novel ancestorial-level weight: using lower DAG Level-weight of LCS(D , D ) by children on longest path to D and D 0.782 .781 Proposed Ontology-based joint multi-source association representation Ancestorial level-based + ontological relationship connection based-Parent, Grandparents only* 0.787 Ancestorial level-based + ontological relationship connection based-Parent, Grandparents & sibling only** 0.790 Ancestorial level-based + ontological relationship connection based-Parent, Grandparents, sibling, uncle & cousin relationships*** 0.802 . The attention weights on multi-source embeddings with respect to  are denoted by  ,  ,  .The weight computation for text definition embedding from DO for ancestor  ∈ __  is computed using equation Eq (4.1) by SoftMax function as follows:  denotes the scalar score functions defined in Eq (4.2) to find the compatibility between text embedding of  from DO and multi-source ancestorial embeddings, which are computed using a single layer feed forward neural network using Eq (4.3).and bias are the learning parameters used by the neural network.Similarly, other attention weights of ancestor  w.r.t  from other sources are calculated in similar manner.Similar kind of equations are adopted in case of ancestor  w.r.t  .B) Novel ancestorial level-weight Algorithm 1   djustment by ORVF 1: ℎ  ⃪___  , ℎ  ⃪___  , 2:  ⃪   ℎ  , ℎ

Table 1 .
Performance of improved sentence representation with different classification models.

Table 2 .
Statistics of the available and constructed DDAE dataset.

Table 3 .
Spearman's rank correlation between enhanced literature-based positive, negative DD association matrix and DO-based similarity metrics (Wang, Resnik, Relevance, Lin) for both positively and negatively associated DD pairs from different sets of labelled DDA dataset.

Table 4 .
Spearman's rank correlation between enhanced literature-based positive, negative DD association matrix with DDA representation and DO-based similarity metrics (Wang, Resnik, Relevance, Lin) for only negatively associated DD pairs from different sets of labelled DDA dataset.

Table 5 .
Spearman's rank correlation between literature-based null similarity DD matrix and DO-based similarity metrics (Wang, Resnik, Relevance, Lin) for null associated DD pairs from different sets of labelled DDA datasets.

Table 9 .
Comparison of the effect of novel ancestorial level-based to that of existing ancestorial level-based DDA quantification using spearman's rank correlation between association scores of DDA pairs obtained using level-weights of ancestors with and without ancestorial embeddings and DDA scores from DisGeNET.

Table 10 .
Comparison of ontology-based joint multi-source association representation and the existing concept-based representation methods for DDA quantification using spearman's rank correlation between association scores of DDA pairs obtained using different concept-based representation methods and DDA scores from DisGeNET.

Table 12 .
Comparison of computation time with base-line models.

Table 13 .
Top 20 associated disease pairs ranked by normalized DDA scores.

Table 14 .
Disease-wise most associated diseases.

Table 15 .
Top 5 associated category pairs ranked by average of normalized DDA scores between intra and inter disease categories.

Table 16 .
Top 10 associated diseases category-wise ranked by normalized DDA scores.