KAE: A Property-based Method for Knowledge Graph Alignment and Extension

A common solution to the semantic heterogeneity problem is to perform knowledge graph (KG) extension exploiting the information encoded in one or more candidate KGs, where the alignment between the reference KG and candidate KGs is considered the critical procedure. However, existing KG alignment methods mainly rely on entity type (etype) label matching as a prerequisite, which is poorly performing in practice or not applicable in some cases. In this paper, we design a machine learning-based framework for KG extension, including an alternative novel property-based alignment approach that allows aligning etypes on the basis of the properties used to define them. The main intuition is that it is properties that intentionally define the etype, and this definition is independent of the specific label used to name an etype, and of the specific hierarchical schema of KGs. Compared with the state-of-the-art, the experimental results show the validity of the KG alignment approach and the superiority of the proposed KG extension framework, both quantitatively and qualitatively.


Introduction
The semantic heterogeneity problem arises whenever there is a need to exploit knowledge graphs (KGs) from heterogeneous resources [36].Here we focus on KGs where nodes are entities decorated by data properties and relations are object properties.Furthermore, we talk of entity type (etype) meaning the class to which an entity belongs, e.g., Person and Event.A solution of semantic heterogeneity is to perform KG extension that extends the reference KG by knowledge encoded in one or more candidate KGs.Meanwhile, KG alignment is an attractive topic that involves various tasks, e.g., well-known ontology matching (OM), where two practical tasks are considered critical steps to achieving KG extension.Firstly, the alignments between etypes from reference and candidate KGs [2,21,32], exploiting only schema level information, as it is the most often case in OM.Secondly, the alignments between reference etypes and candidate entities, exploiting additional information of entities, also known as etype recognition [43,47].These two kinds of alignments are prerequisites for extending the reference KG at both the schema level and instance level, respectively.
Etype alignment is a prepositive task during KG extension since etypes define the schema of collected entities in a KG.The reference KG can directly integrate candidate entities if their corresponding etypes align with reference etypes.Existing etype alignment methods mainly exploit natural language processing (NLP) based [14,52] and structure-based [24,35] techniques.Both techniques enforce etype label matching as a prerequisite.NLP-based methods utilize diverse lexical-based similarity metrics and synonym analysis to align etype labels, which raises limitations when applied in practice.Labels may suggest a wrong etype [49], where the same concept can be labeled differently by KGs.For instance, an eagle can be labeled as Bird in a general-purpose KG and Eagle in a domain-specific KG.In turn, the same label may present different concepts in heterogeneous KGs, which will also lead to wrong recognition results.Structure-based methods consider the KG hierarchy as an additional input, where the structure of a hierarchy is used to drive the label matching, e.g., it is suggested to perform label matching to sub-classes or super-classes.However, these methods may also mislead the conclusions as properties assigned to an etype in the hierarchy are cumulative and depend only on nodes in the path from the root and, therefore, do not depend on the order by which they are assigned [30].In addition, the difference in taxonomy between KGs will increase the impact of such mistakes, e.g., the super-class of etype Eagle can be Animal in one KG and Bird in another KG.On the other hand, the alignments between etypes and entities (here we call etype recognition task) are also critical since the reference KG can also be extended by candidate entities that are not included in aligned etypes.However, it is not easy to identify such alignments due to the lexical labels are not applicable to alignment entities and etypes.For instance, entity apple can be a company, a fruit, or the name of a pet, but there is only trivial lexical similarity between entity apple and etypes company, fruit and pet.Thus, current NLP-based and structure-based methods poorly perform the etype recognition task either.
As a solution to the above problems, the main intuition of this paper is to align etypes/entities with reference etypes for KG extension on the basis of the properties used to define them.It is, in fact, the properties that are used to intentionally define an etype, and this definition is independent of the specific label and also independently of its hierarchy [27].This allows us to think of etypes as being organized in hierarchies, where lower etypes inherit properties from upper etypes and where the entities populating an etype also populate all the upper etypes.In turn, it allows us to think of KG alignment as a problem of matching inheritance hierarchies, where etypes may or may not be populated with entities.Figure 1 (a) provides one such hierarchy, used as a running example in the rest of the paper.In practice, most relevant KGs are associated with large numbers of properties, see, e.g., DBpedia [4] and OpenCyc [23].And the reason for this is quite obvious, being that one of the purposes of KG extension is exactly that of extending the number of properties.
In this paper, we propose a machine learning (ML)-based framework to realize KG extension task, including an algorithm for generic KG alignment based on the above intuition of exploiting properties.We introduce a formalization strategy, where we organize a KG and its inner mappings between etypes/entities and properties based on the use of formal concept analysis (FCA) [27].Then, we present three property-based metrics to measure the similarity between etypes and entities, where the metrics characterize the role that properties have in the definition of given etypes from different aspects.They capture the main idea that the number of aligned properties affects the contextual similarity between etypes and entities.The proposed similarity metrics and algorithm are applied in two key modules of the KG extension framework, i.e., etype alignment and KG extension modules.Finally, we will obtain an extended reference KG by integrating etypes and entities from the candidate KGs.
Overall, the main contributions of this paper are as follows: -We design an ML-based framework for KG alignment and extension tasks, where etype alignment and etype recognition are introduced as two key procedures.-We proposed a novel set of property-based metrics for measuring contextual similarity between KGs while introducing an FCA-based KG formalization strategy.-We implement a classification-based method for KG alignment, which exploits the property-based information by our proposed metrics.-We compare our method with state-of-the-art etype recognition methods from several different perspectives.The experimental results show the validity of the similarity metrics and the superiority of the proposed KG extension framework, both quantitatively and qualitatively.
The rest of the paper is organized as follows.Section 2 discusses the motivation for exploiting properties for etype recognition and section 3 demonstrates how we formalize a KG and its inner mappings into FCA contexts.Section 4 introduces three specificity measurements and their corresponding property-based etype similarity metrics.In section 5, we describe the proposed ML-based KG extension framework and the details of individual modules.Then we discuss the experiment setups and experimental results in Section 6.Finally, we present the related work in Section 7 and conclude the paper in Section 8.

Task Description
In the knowledge integration area, KG alignment and extension can be organized as sequential tasks, where KG alignment aims to align the knowledge from candidate KG and reference KG, and KG extension aims to integrate target information into the reference KG following the aligned knowledge.To clearly define the above-mentioned tasks, we define a KG as a hierarchy of concepts, where properties are used to describe them.Specifically, we define the schema of a KG and its inner relations as KGS = ⟨C, P, R⟩, where C = {C 1 , ..., C n } being the classes of entities (i.e.etypes), P = {p 1 , ..., p m } being the set of properties, R = {⟨C i , T (C i )⟩|C i ∈ C} being the set of correspondences between etypes and properties, and function T (C i ) returns properties associated with C i .As for the entities I, we define I = {I 1 , ..., I l } being the set of entities in KG, where each entity I i can be identified by one or several etypes.t(I i ) refers to the set of associations between entities and properties.We consider that the property p i is used to describe an etype C i or an entity I i when the property belongs to set T (C i ) or t(I i ), respectively.Thus, given a reference KG ref and a candidate KG cand , considering the two cases in KG alignment, etype alignment task aims to align the candidate etypes

Observations and Motivations
Property is one of the most basic and critical elements for intentionally defining KG concepts, which is independent of the specific label used to name concepts and of the specific hierarchical schema of KGs [31].For each KG schema, etypes play the role of categorization, and properties aim to draw sharp lines so that each entity in the domain falls determinately either in or out of each etype [29].Meanwhile, we also have the following observations when comparing properties across different KGs: -In a specific KG, each etype is described by a set of properties, whereas most of the properties are distinguishable according to the belonging etypes and a small number of properties are shared across different etypes.-Same or similar properties are shared across different KGs for describing same concepts.
To clearly present these observations, we introduce a special type of Venn graphs, namely knowledge lotuses, to represent the shareability of properties that occurs within and across KGs.Knowledge lotuses provide a synthetic view of how different KGs overlap in properties [30].Figure 1 show several examples 2 , where we assume that we have four contexts built from (parts of) the four biggest KGs, namely OpenCyc3 , DBpedia4 , Schema.org 5 , and FreeBase [9].Each value in a lotus refers to the number of shared properties.For instance, in Figure 1 (b), we notice that the two etypes Person and Organization share four properties in DBpedia.In Figure 1 (c), we can see that All these examples present features of properties, namely, the unity for describing the same concept and the diversity for distinguishing different concepts.For instance, a Person can be distinguished from a Place by the property birth, which is a crucial step to align etypes and identify the type of entities.Thus, inspired by these observations, we find that it is important to exploit and measure properties for better performance of KG alignment and extension tasks.

Knowledge Graph Formalization
We formalize the relation between properties and KG concepts as associations to introduce the property information into the target KG tasks.At the schema level, the KG schema will be flattened into a set of triples, where each triple encodes information about etype-property associations, e.g., triple "organization-domain-LocatedIn" encodes the "organization-LocatedIn" association.Instance-level cases generally define triples as "entity-property-entity", where two associations are encoded.For instance, instance-level triple "EiffelTower-LocatedIn-Paris" encodes two entity-property associations "EiffelTower-LocatedIn" and "LocatedIn-Paris".We introduce FCA [27] to encode such associations as a formalization of the KGs we process.Notice that both the schema-level and instance-level associations are included in the formalization.Specifically, we have two working assumptions: -We consider both etypes C and entities I. Similar to general formalization methods [18,51], we associate an entity with its set of properties t(I i ).Different from general methods, we also associate an etype with its set of properties T (C i ).-Etype characterization exploits not only the properties associated with it but also the properties which are associated with other concepts.Thus, we introduce the notion of unassociated properties and exploit this distinction in the formalization process.
As an example, the hierarchy of the KG schema in Figure 1 is extracted from DBpedia [4].In each box, etypes are presented in yellow and their properties in green.The arrow refers to the sub-class of relation between two etypes.We then formalize these etypes into an FCA context as shown in Figure 2. We adopt the following conventions.The value box with a "+1" represents the fact the property is associated with the etype, e.g., Fig. 2: An example of formalizing KG into FCA contexts citizenship is associated with Person.The value box with a "-1" means the property is unassociated with the etype, e.g., date is not used to describe etype Person.The value "0" (for undefined) represents the fact that the property is unassociated with the etype but associated with one of its sub-classes.The intuition is that the property might or might not be used to describe the current etype, e.g., academy award is used to describe Artist and it might be used to describe Person since Artist is a subclass of Person.
Similar to etypes, we can also find formalized entities and their properties in Figure 2.
Need to notice that these entities are selected from the KG with a hierarchical schema, thus, they can inherit the unassociated properties from their etypes, e.g., as an Athlete, Usain Bolt does not have property duration.We encode the above-mentioned three correlations as the parameter w E (p).Since the correlation of "associated with" is positive for a property-based description, the correlation of "unassociated with" is negative and the correlation of "undefined" is neutral, we define the parameter as: where we suppose K as the union of etypes C and entities I in a KG for a general demonstration and the etype/entity E ∈ K, p is the target property, E.subclass refers to the sub-classes of the etype E and prop(E) refers to the properties associated with E. Need to notice a special case that undefined properties also exist where a specific entity misses the inherited property, such as race track is used to describe Sports events but missed for its entity Snooker Championship, which will also make w E (p) = 0.In this case, we encode properties for each etype and then record the properties associated with each entity.If an entity does not contain a property it should inherit from its etype, it is considered undefined.With our designed KG formalization method, we distinguish several concrete situations for learning the association between properties and etypes/entities and encoding more property information.The proposed formalized approach serves as a pre-process for subsequent property-based similarity methods used in computations in the following sections.

Property-based Similarity Metrics
One of the intuitions of our work is to identify etypes and entities by properties that are essential elements for defining KG concepts [30].The reference etype tends to match with the candidate etype/entity when they have properties overlapped.Therefore, it is critical to measure the overlapped properties between etypes since etypes with completely overlapped properties rarely happen in real-world KGs [26].Meanwhile, we also exploit the intuitions underlying the normalization of the "get-specific" heuristic provided in [33] to distinguish the weights of different overlapped properties.The key inspiration is that properties at different levels of specificity have different relevance in the etype recognition.In particular, a more specific property provides more information that allows for defining concepts.As a result, in this section, we introduce three notions of horizontal specificity, vertical specificity, and informational specificity and their corresponding similarity metrics for measuring the degree of overlapped properties.

Horizontal Specificity
For measuring the specificity of a property, a possible idea is to horizontally compare the number of etypes that are described by a specific property, namely the shareability of the property [30].If a property is used to describe diverse etypes, it means that the property is not highly characterizing its associated etypes.Thus, for instance, in figure 1, the property name is used to describe Person, Place, Athlete and Artist, where name is a common property that appears in different contexts.Dually, settlement is a horizontally highly specific property since it is associated only with the etype Place.Based on this intuition, we consider the specificity of a property as related to its shareability.Therefore, we propose HS (Horizontal Specificity) for measuring property specificity.More precisely, HS aims to measure the number of etypes that are associated with the target property in a specific KG.We model HS as: where: p is the input property and K p is the set of etypes described by the input property in a specific KG and |K p | is the number of etypes in K p , thus |K p | ≥ 1; e denotes the natural mathematical constant [25]; λ represents a constraint factor, the range of λ is (0, 1].The reason for using an exponential function to model the HS is that we aim to normalize the horizontal specificity.The motivation is that different properties may have a larger difference on |K p | in large KGs.Need to note that the range of the HS is [−1, 1], and the extremum points both indicate that the property is shared by one or a few etypes, signifying that the property is highly specific.The impact of such specificity is determined by the target etypes used for description, which is captured by the λ, resulting in different outcomes.

Vertical Specificity
Etypes are organized into classification hierarchies such that the upper-layer etypes represent more abstract or more general concepts, whereas the lower-layer etypes represent more concrete or more specific concepts [33,45].Correspondingly, properties of upperlayer etypes are more general since they are used to describe general concepts, vice versa, properties of lower-layer etypes are more specific since they are used to describe specific concepts.We assume that lower-layer properties will contribute more to the identification of an etype since they are more specific.For instance, in Figure 1, as a lower-layer etype, Artist can be identified by the property academy award but not by the property name.Based on this intuition, we propose V S for capturing the vertical specificity, as follows: where layer(E) refers to the calculation based on the layer of the inheritance hierarchy where an etype E is defined, under a min-max normalization.Note that layer(E) ∈ [0, 1] and all E in set K p are described by p.The range of V S is [−1, 1], and extremum of the V S = 1 and V S = −1 demonstrates the property is hierarchically specific while holding opposing impacts on the target etype, and V S = 0 demonstrates the property is not specific enough from the hierarchical point of view.

Informational Specificity
Horizontal specificity allows measuring the shareability of properties, which is independent and does not change (increase/decrease) with the number of entities populating it.We take into account this fact by introducing the notion of informational specificity IS.The intuition is that IS will decrease when the entity counting increases.Thus, for instance, the IS of gold medalist decreases when there are increasing entities of athletes, as from the schema in Figure 2. Clearly, IS, differently from HS, can be used in the presence of entities.The definition of informational specificity is inspired by Kullback-Leibler divergence theory [54], which is introduced to measure the difference between two sample distributions Y and Ŷ .More specifically, given a known sample distribution Y , assume that a new coming attribute x changes Y to Ŷ .Then, the Kullback-Leibler divergence theory demonstrates that the importance of x for defining Y is positively related to the difference between Y and Ŷ .In the definition of informational specificity, we need to exploit some notions from information theory, where we apply informational entropy H(K) as: where K v refers to any subset of K in a KG and |K v | is the number of etypes included; H(•) represents the informational entropy of an etype set; E i is a specific etype in set refers to the number of samples of etype E i , and F (K v ) refers to the number of all samples in K v .Need to notice when we calculate the informational entropy for KGs without entities, F (E i ) = 1 since each KG includes one etype sample.For KGs with entities, F (E i ) and F (K v ) depend on the number of samples of the given etype and the subset of KG.After obtaining informational entropy, the informational specificity IS of property is defined as: where K is the set of all etypes in KG, and K is the set of etypes without associated with the input property p, thus K = K − K p .Being subtracted by the overall informational entropy H(K), IS presents the importance of the property p for describing the given etype set K. For computational convenience, we employ the min-max normalization method to constrain the calculation results within the range of [−1, 1].

Similarity Metrics
We have modeled the specificity of properties, which represent their weights for describing KGs from different measuring aspects.Then, we define three similarity metrics based on the corresponding specificity to measure the property overlapping between two concepts.Given two KGs, the reference KG A and candidate KG B, we define a function for calculating different similarities between etypes/entities from A and B based on their corresponding specificity: where we take Notice that each similarity metric is symmetric, more specifically, Sim(E a , E b ) = Sim(E b , E a ).Note also that we apply z-score normalization [42] to all similarity metrics at during calculations to constrain the range of Sim H , Sim V , Sim I in [0, 1] for computational convenience purposes.The normalization following the function z = x−µ σ , where z is the normalized value, x is the original value, µ is the mean of all values, and σ is the standard deviation of all values.Different from KG completion task [57,60] which manipulates entities in a single KG, there will be two (or more) KGs involves in the KG extension task, where we have KG ref and KG cand as inputs.The KG parser aims to parse the input KG as a structured set of etypes, entities, and properties.The KG will be flattened into a set of triples, where associations will be extracted from triples.Then the triples and associations will be used to generate an FCA for each input KG by the KG formalization module 6 .After all properties of input KGs have been collected, they are sent to the NLP-based property matcher.Different labels of properties may express similar meanings since many of them are variations of the same label.Thus, an NLP pipeline is designed to normalize all input properties, where phrase segmentation, lemmatization, and stop-word removal are introduced for better normalization performance.String-based and language-based similarity metrics are exploited for matching the properties by normalized labels [6,7].Then, we align properties from input KGs following the results of property matcher.In the next phase, we generate the proposed property-based similarity metrics Sim H , Sim V , and Sim I by inputting the FCA contexts and aligned properties.According to the Function (6), three similarity values will be generated for each etype pair, which will then be passed to the etype matcher together with general string-based and languagebased similarity metrics.We will align etypes by the output of the etype matcher7 .In the for all (pn, pm) ∈ P M do 6: simH (Ej, E k ).add( final step, the knowledge extension module 8 integrates entities from the candidate KG according to the new reference KG schema that is updated with aligned etypes.In Figure 3 we use a circle arrow to demonstrate this process since our method will first implement etype-etype alignment to integrate the aligned etypes directly into the reference KG, and then implement entity-etype alignment to integrate the rest of entities.The final output of the framework is an extended reference KG with integrated etypes and entities.

Similarity Calculation Algorithm
The property-based similarity calculation is one of the critical parts of this work.We detail the calculation as SimilarityCalculation(•), as shown in Algorithm 1.After formalizing reference KG ref and candidate KG cand , we assume that the two FCA contexts f a and f b are generated correspondingly.Then we obtain property matching pairs P M from the property matcher.To calculate the similarity of etypes, we need to generate candidate etypes pairs EM for further processing.For each candidate pair in EM , we check their correlated properties and update the specificity values to Sim H , Sim V and Sim I when their properties are aligned.After traversing all the candidate pairs, we obtain a complete etype similarity list L which will be used for training the ML model and aligning candidate etypes.Notice that we present the algorithm for calculating the horizontal similarity Sim H in algorithm 1, the metrics vertical similarity Sim V and informational similarity Sim I will be calculated by the same algorithm where the only modification is to change HS KG (•) to V S KG (•) and IS KG (•), respectively.

Knowledge Graph Extension Algorithm
With the help of etype alignment results, we extend the reference KG by integrating the entities from the candidate KG KG cand , details are shown in Algorithm 2. For each etype E b that is aligned with etype E a from KG cand , we directly add its properties and entities into E a by addP roperty(•) and addEntities(•), respectively.We also consider all subclasses of the aligned etype E b since the subclass inherit the properties of the etype and will bring new entities.If the subclass E sub of E b is not aligned with any other etypes in KG ref , E sub and its associated properties and entities will be merged into E a .Thus, we integrate candidate entities with KG ref when their etypes are able to align with KG ref .Then, the proposed property-based similarity metrics are applied to align the rest of the candidate entities with etype E ref from reference KG, namely function EtypeRecognizer(•).If we successfully match an etype E n with the entity Ent i , Ent i will be merged into E n , if not, Ent i will be discarded.Need to notice that depending on the real-world application scenario and topic, the KG engineer will decide if it is needed to integrate the not aligned etypes with their correpsonding entities into KG ref .Finally, all changes will be updated to KG ref and we will obtain an extended KG KG ext as the final result of our method.

Machine Learning-based Matchers
According to the Algorithm 2 and Figure 3, we can find there are three ML-based matchers in our proposed framework.Here we present more details of training and setting these matchers.

Etype Matcher
We develop an ML-based method that deals with etype matching as a binary classification task.The main idea is to predict if two incoming etypes are aligned with each other.For applying this method, a list of candidate pairs are generated by pairing etypes from KG cand and KG ref .We will record candidate etype pairs EM ali when the result of classification is "aligned".The proposed property-based similarity metrics Sim H , Sim V and Sim I are introduced to train the ML models for matching etypes.For better performance, we also exploit label-based and language-based similarity metrics, along with property-based similarity metrics for training the etype matcher.

Etype Recognizer
The strategy for developing an etype recognizer is very similar to the etype matcher, where we predict if candidate entities are aligned with target etypes.Thus, we will also generate candidate pairs that consist of entities from KG cand and etypes from KG ref .Etypes from KG ref will be outputted as the final recognition results.Compared to the etype matcher, the main difference is that the etype recognizer mainly uses property-based similarity metrics as features for model training since the lexical labels are not applicable to match entities and etypes.Thus, property-based similarity metrics Sim H , Sim V , and Sim I are applied for the etype recognizer.

Property Matcher
The property matcher aims to align properties between KGs, where label-based and language-based similarity metrics are used for modeling training.The matching strategy is same as etype matcher.It is critical to obtain a powerful property matcher since both etype matcher and etype recognizer are based on the result of property matcher.In this section, we discuss the following solutions to reduce the effect of misaligned properties.
-Use of the formalization parameter w E (p).As we introduced in section 2, besides "associated" (positive) and "unassociated" (negative) properties, we also defined "undefined" properties (neutral).Since misaligned properties will not be used for similarity calculation, they are treated as "undefined" properties which will not affect the model training and reduce the additional interference.However, additional interference from misaligned properties appears if "unassociated" and "undefined" properties are not distinguished.-Use of similarity metrics.Similar to lexical-based similarity metrics, our propertybased similarity metrics also allow to match etypes by soft aligning, even if there are few properties not aligned.This will increase the robustness of our etype recognition approach.
-Use of ML-based models.By learning from the practical data from different resources, ML models will propose a learnable strategy rather than a fixed threshold for determining alignments, which will maximize the use of existing aligned properties and minimize the effect of misaligned properties.

Evaluation
In this section, we aim to evaluate our proposed method by two crucial steps during KG integration, including (1) etype alignment, and (2) etype recognition of entities.Thus, we organize this section as follows.Section 6.1 introduces the experimental setups, including the datasets we used, feature selection, and evaluation strategy.Sections 6.2 and 6.3 present the analysis and quantitative evaluation results, respectively.In section 6.4, we also demonstrate the ablation study to explain the setting of parameters.

Experimental Setup
Dataset Selection.For evaluating the result of etype alignment, we exploit the Ontology Alignment Evaluation Initiative9 (OAEI) as the main reference for the selection of the etype recognition problems.As of today, this in fact the major source of KG alignment problems.Our proposed method for KG extension involves extending a reference KG through one or more candidate KGs.This implies that the reference KG typically possesses a more comprehensive and high-quality schema, serving as a foundation for KG extension.Our approach focuses on KGs that incorporate etypes associated with a substantial number of properties and complete schemas.As a result, we have selected the following cases: the bibliographic ontology dataset (BiblioTrack) [20] and conference track (ConfTrack) [59] (ra110 version).From the bibliographic ontology dataset, we select #101-103 and series #301-304, which present real-life ontologies for bibliographic references from the web.We select the alignment between #101 and #304 as the training set for training our ML-based etype matcher, and the rest of the ontology alignments as the testing set.The conference track contains 16 ontologies, dealing with conference organizations, and 21 reference alignments.We set all 21 reference alignments from the conference track as the testing set to validate our etype matcher.Notice that we select the training and testing set from different datasets since we aim to prove the adaptation of our approach, which also prevents our approach from overfitting.For validating the algorithm of etype recognition, we build a dataset called EnType, since there is no publicly released dataset for such etype recognition task between two KGs.We exploit DBpedia infobox dataset 11 as the reference KG for providing reference etypes.Because DBpedia is a general-purpose KG that contains common etypes in the real world, where sufficient properties are applied for describing these etypes.Then we select candidate entities from DBpedia, SUMO and several domain-specific datasets [53].The entities we selected mainly according to common etypes, more specifically, Person, Place, Event, Organization and their sub-classes.Finally, we obtain 20,000 etype-entity candidate pairs, where 6,000 from DBpedia (EnType Self ) and 14,000 from the remaining resources (EnType Gen ).Need to notice, this dataset will be randomly separated into the training and testing set to implement the ML model.
Feature Selection.Our approach applies a general binary classification strategy, which is independent of the specific ML model.Thus, the data label of positive and negative samples refers to if the pair etype-etype or etype-entity is aligned.Besides our propertybased similarity metrics, some of the most common string-based and language-based similarity metrics are selected as additional metrics for achieving better etype recognition performance.The data consists of three kinds of features, which are propertybased similarity metrics (Sim H , Sim V and Sim I ), string-based similarity metrics (N-gram [21], Longest common sub-sequence [21], Levenshtein distance [58]) and language-based similarity metrics (Wu and Palmer similarity [41] and Word2Vec [17]).These similarity metrics aim to measure different aspects of the relevance between the reference etype and candidates.Since all the above-mentioned similarity metrics are symmetric, the order of etype/entity in the candidate pair will not affect the final results.Moreover, we apply only property-based similarity metrics (Sim H , Sim V and Sim I ) for etype recognition since the label of the candidate entity are commonly not relevant to its etype.
Training Strategy.We have discussed that candidate pairs for applying binary classification in different etype recognition tasks.However, positive and negative samples in candidate pairs are usually not balanced in practice, more specifically, negative samples will be produced much more than positive samples.For instance, there will be one million candidate pairs generated when aligning two KGs that contain 1,000 etypes each, where only hundreds of them are positive samples.As a result, the ML model trained by such unbalanced datasets easily to be overfitted.To address this issue, we propose a model training strategy that increases the weight of positive samples to achieve a balanced training set and alleviate overfitting.By duplicating a part of positive samples, we achieve the best etype recognition performance by keeping the ratio of positive and negative samples as 1:10.Notice that such a data augmentation strategy is not applied to the testing set, the candidate pairs in the testing set are selected randomly to avoid interference.
Dealing with Trivial Samples.In order to decrease the negative samples and avoid generating unnecessary candidate pairs, we prune the trivial samples which are obviously to be negative.Regarding different tasks, we have: -For etype alignment task, we apply label-based measurements to filter the obviously negative samples.Given two etypes E a and E b from a candidate pair, we define the pre-selection factor P S s as: where N gram(•) and W ord2V ec(•) are two similarity measurements; N gram(•) assesses lexical similarity between strings by analyzing contiguous sequences of n items from a given string [21]; W ord2V ec(•) captures semantic similarity between strings through deep feature representations using the Word2Vec model [17].Thus, we consider (E a , E b ) to be an obviously negative candidate etype pair if P S s is greater than the threshold th.Experimentally, we find that th = 0.3 will lead to better results, where around 70% of trivial sample pairs are reduced while maintaining minimal interference with the majority of positive samples.-For the etype recognition task, the obviously negative candidate pair is identified when two entities I a and I b have no shared property.Thus, such candidate pairs will be pruned before inputting into the etype recognizer.
The pre-selection of trivial samples will effectively reduce the run-time for model training.Moreover, such a training strategy will help to decrease the risk of overfitting by alleviating redundant samples and to improve the performance of KG alignment.
Evaluation metrics.In our experiment, we exploit standard evaluation metrics including precision (Prec.),recall (Rec.) and F 1 -measure (F 1 -m.), and additional F 0.5measure (F 0.5 -m.) and F 2 -measure (F 2 -m.) [44] to comprehensively validate our method and compare it with state-of-the-art methods.We form the etype recognition candidates as pairs, where each pair consists of a reference etype and a candidate etype/entity alignment.F β -measures (F β -m.) are defined as the harmonic mean of recall and precision: where β is the configuration factor that allows for weighting precision or recall more highly if it is more important for the use case.And metrics F 1 -m., F 2 -m. and F 0.5 -m. are defined when β = {1, 2, 0.5}, respectively.We consider the F β -m. to be the most relevant metrics for evaluation since it reflects both recall and precision.

Etype Alignment
Qualitative analysis.Table 2 provides representative examples to show the etype similarity metrics of candidate etype-etype pairs from cmt-confof (former four rows) and cmt-conference (latter four rows) in ConfTrack.Value box "Match" demonstrates if two etypes are referring to the same concept, where × refers to a positive answer.We find that the value of our property-based similarity metrics indeed capture the contextual similarity between relevant etypes, where aligned etypes output higher values (e.g., paper-contribution), in turn, non-aligned etypes return lower values (e.g., persondocument).With a broad observation of the metric values, we consider the propertybased similarity metrics Sim H , Sim V and Sim I are valid for etype-etype pairs.
Quantitative Evaluation.We apply two ML models to evaluate the validity of our proposed similar metrics on the etype alignment task, including XGBoost [16] and artificial
FCAMap [15] 0.680 0.625 0.668 0.651 0.635 0.820 0.783 0.812 0.801 0.790 AML [24] 0.832 0.630 0.782 0.717 0.662 0.869 0.822 0.859 0.845 0.830 LogMap [35] 0.798 0.592 0.746 0.680 0.624 0.832 0.694 0.800 0.757 0.718 Alexandre et.al. [8] 0.795 0.638 0.758 0.708 0.664 0.827 0.786 0.818 0.806 0.794 Nkisi-Orji et.al. [39] 0.860 0.514 0.758 0.643 0.559 0.858 0.682 0.816 0.760 0.711 LogMapLt [35] 0.716 0.554 0.676 0.625 0.580 0.796 0.781 0.793 0.788 0.784 ETAXGBoost 0.827 0.676 0.792 0.744 0.702 0.870 0.832 0.862 0.851 0.840 ETAANN 0.813 0.604 0.760 0.693 0.636 0.797 0.811 0.799 0.803 0.808 neural network (ANN) classifier [37], namely ETA XGBoost and ETA AN N .We compared our work with state-of-the-art etype alignment methods 12 , including FCAMap [15], AML [24], CNN-based ontology matching [8], word-embedding-based ontology alignment [39], LogMap and LogMapLt [35].We calculate mentioned evaluation metrics for a comprehensive comparison.Notice that we focus on the result of etype-etype alignment in this experiment.Table 3 shows the results of our approach with the different models mentioned above, compared with the results of state-of-the-art methods.Firstly, we can find our approach with different models produces different results.ETA XGBoost performs better than ETA AN N and also outperforms other methods on BiblioTrack and most of the cases in ConfTrack.The method by Nkisi-Orji et.al. [39] leads the precision on Conf-Track, while it poorly performs on recall for both datasets.AML also shows competitive overall results compared to other state-of-the-art methods.Considering the average results of our approach with different models are performing better or close to the stateof-the-art, we can say that our approach surpasses the state-of-the-art competitors on the etype alignment task 13 .Meanwhile, the promising overall performance produced the same models as in the etype alignment task, i.e., ETR XGBoost and ETR AN N .Two state-of-the-art methods [30,49] are also included in the experiment.Firstly, we can find that our methods ETR using the proposed property-based similarity metrics surpass two state-of-the-art methods significantly.For the self recognition group, ETR XGBoost performs better in two cases while ETR AN N also achieves competitive results.For general recognition cases, the XGBoost-based ETR method keeps its stable performance and surpasses all comparing methods.We achieve promising results in both cases, which prove the validity of our similarity metrics and the etype recognition approach.Meanwhile, we find that the precision of the general recognition case is lower than that of self recognition, which follows the difficulty of the two cases.
Considering that etype recognition performance is affected by entity resolutions, we apply an additional experiment for aligning entities with more specific etypes.We select four sub-classes of etype person and organization and their corresponding entities as candidate pairs, respectively.We keep comparing ETR XGBoost , ETR AN N and the same state-of-the-art methods in this experiment.Table 6 presents the F 1 -measure of the recognition results.We can find our methods still achieve better recognition performance than other methods in each case.Both two models obtain promising overall performance on specific etype recognition, where ETR AN N performs better on Comedian and Company, and ETR XGBoost leads the rest of the cases.The experimental results show that our metrics and approach can also be applied for specific instance-level etype recognition, which further supports the performance of KG extension algorithm.

Ablation Study
We demonstrate ablation studies in this section for validating the effectiveness of some specific components introduced in our KG extension framework.

Effect of similarity metrics
The first ablation study is to evaluate if each of the proposed property-based similarity metrics is effective.In this experiment, we test the backbones 14 (B) which were used in the etype alignment and recognition tasks, respectively.Based on the backbones, we also design a controlled group that includes models trained without one of the property-based similarity metrics (i.e.B-Sim V , B-Sim H and B-Sim I ) and models trained without all metrics (i.e.B-L).If the backbones perform better than the corresponding models in the controlled group, we can quantitatively conclude that each of the property-based similarity metrics (Sim V , Sim H , Sim I ) contributes to the etype alignment and recognition tasks.Table 7 demonstrates the F 1measure of each group.We apply ConfTrack for etype alignment and EnType Gen for etype recognition.Note that we select two models for both cases as Table 7 shows.We find that backbones perform better than models in the controlled group, especially for models trained without all metrics.Thus, we consider all property-based similarity metrics contribute to better recognition performance.Particularly, Sim V and Sim H significantly affect the performance of etype alignment cases, and Sim I affects etype recognition cases more.Effect of constraint factor In section 4.1, we defined a constraint factor λ for calculating the metric Sim H .This study aims to statistically identify the value of λ.We apply the dataset ConfTrack and its two best-performed models.The value of λ is set evenly from 0.1 to 1 by discrete points.We evaluate if this pre-set factor affects the final recognition performance and obtain the best value of λ for generic etype recognition.Table 8 demonstrates the results, where we highlight both the best and second-best results.We can find that different values of λ do affect the final etype recognition performance.And two models show a similar trend that the best value of λ is close to 0.5.As a result, we assign λ = 0.5 to calculate metric Sim H in our experiments.

Case Study
This section aims to qualitatively analyze the KG extension performance of our proposed method by use cases.In the case of KG extension, we assume that there will be a reference KG extended based on one or more candidate KGs.We aim to simulate a real-world scenario, where people extend the general purpose KG by specific-domain KGs to enlarge its usability.We select the widely applied schema.orgas the reference KG.For the candidate KGs, we introduce two specific-domain KGs Transportation15 and educationtrentino 16 .These two KGs are created for presenting local transportation and education, respectively.We selected these KGs because they provide very different examples in terms of the number of properties and etypes.Moreover, almost all their etypes labels are human understandable, which helps qualitative analysis.We introduce four ranking metrics in the case study, namely Class Match Measure (CMM) [1], Density Measure (DEM) [1], Focus [26] and TF-IDF [46].CMM aims to evaluate the coverage of a KG for the given search etypes, by looking for etypes in each KG that have labels matching a search term either exactly or partially.DEM is intended to approximate the representational density or information content of etypes and consequently the level of knowledge detail, considering etypes including the number of subclasses, the number of properties associated with that etype, the number of siblings, etc. Focus aims to evaluate KG by identifying informative etypes with higher categorization relevance, by using the properties associated with the target etypes.TF-IDF is also widely used for KG ranking by calculating relevance between potential KG with a specific term that describes the domain of interest.These metrics are used to evaluate the quality of the extended reference KG, where we compare the metrics on the original reference KG with the extended reference KG to see if the new-coming items from candidate KGs affect the quality of the reference KG.More specifically, we record the metrics for valid etypes in KG, and use the performance of these scores as a baseline, by selecting their scores for the top 15 etypes.The relevance of our approach is then measured in terms of accuracy (from 0 to 1) by checking how many etypes of ranking results are in the eypes ranking lists provided by the knowledge engineers.The output of this experiment is represented in Fig. 4.
The main observation is that our extended KG shows promising performance with all metrics.More specifically, with the given terms chosen by knowledge engineers, we can find the extended KG shows great categorization relevance by metric Focus.The scores of CMM and DEM also show promising scores, respectively, which demonstrates the quality of the knowledge coverage and the representational density.Our KG gains a fair score on metric TF-IDF since the reference schema.org is a general-purpose KG that contains pretty diverse concepts, which will affect the relevance with specific domain corpus to some extent.The second observation presents that the extended KG performs better than the original KG on most of the metrics.The increasing score of metric CMM presents that the new-coming items enlarge the coverage of the etypes in the KG.The improving DEM score shows the details of etypes from reference KG have been enhanced by the extended properties and subclasses.The increasing Focus score also presents a trend where categories can be better identified from a taxonomy perspective.Notice that the extended schema.orghas even more diverse concepts af-ter KG extension, which becomes a possible reason for the slight decrease in TF-IDF score.Overall, the experimental results in the case study demonstrate the ability of our method for KG alignment and extension.
8 Related Work

Ontology Matching and Schema Alignment
Ontology matching and schema alignment are attractive research topics in recent decades.In the early phases of the research, researchers mainly focused on string-based methods.String analysis techniques were defined including 1) string-based metrics (N-gram, Levenshtein, etc.), 2) syntactic operations (lemmatization, stop word removal, etc.) and 3) semantic analysis (synonyms, antonyms, etc.) [14].Sun et al, [52] review a wide range of string similarity metrics and propose the ontology alignment method by selecting similarity metrics in different scales.Although string-based methods can lead to effective performance in many cases, selecting the right metric for matching specific datasets is the most challenging part.To solve this issue, an ensemble matching strategy is introduced in some studies [12,38], which apply multiple matchers based on different string-based metrics.The principle of these works is that the combined matchers are more powerful than individual ones.The structure of a KG has also been considered as important information for identifying etypes [5,28].Such studies suppose that two etypes are more likely to be aligned if they have the same super-class or sub-class.The LogMap system [35] uses a two-step matching strategy, that is, matches two etypes E a and E b by a lexical matcher, and then considers the etypes that are semantically close to E a are more likely to be semantically close to E b .AML [24] introduces an ontology matching system that consists of a string-based matcher and a structure-based matcher, building internal correspondences by exploiting is-a and part-of relationships.
Machine learning techniques have been widely applied to this topic.Some studies model the etype matching task as a binary classification task, trying to encode the information like string and structure similarities as features for model training.Amrouch et al, [3] develop a decision tree model by exploiting lexical and semantic similarities of the etype labels to match schemas.By encoding the lexical similarity of the superclass and subclass as structure similarity, Bulygin and Stupnikov [13] improve the former method and achieve promising results.At the same time, formal concept analysis (FCA) lattices are applied in schema matching methods [15,51].To refine the health records searching outputs, Cure et al, [18] exploit FCA and Semantic Query Expansion to assist the end-user in defining their queries and in refining the expanded search space.Stumme et al, [51] propose a bottom-up ontology merging approach by using FCA lattices to keep the ontology hierarchy.

Entity Type Recognition
According to the different usage and motivation, studies on entity type recognition (also called entity typing) focus on three main directions: (1) recognizing the type of entity from text [40,56]; (2) recognizing the type of entities from the single KG for KG completion [57,60]; (3) recognizing the type of entities from different KGs for KG extension [49], which is the focus of this paper.Different from the former two tasks, we focus on recognizing the type of etypes/entities from other KGs for extending the reference KG automatically.In this field, some dedicated methods are proposed for specific datasets [19,47].Rather than using label-based methods, some previous studies also consider properties as a possible solution, [48,49] propose an etype recognition method by modeling etype recognition as a multi-class classification task.However, a pre-filtering step is needed since only properties shared across all candidates are counted for training and testing, which means there will be a few properties remaining after such filtering and a large amount of critical information will be discarded.Thus, the adaptation of such methods will be limited when applied in practice.Giunchiglia and Fumagalli [30] propose a set of metrics for selecting the reference KG to improve the above method, which achieves improved performance with the support of a large amount of KGs.However, there are still limitations since these studies consider all properties with the same weight and neglect to distinguish properties that will contribute differently during etype recognition.

Knowledge Graph Extension
KG extension aims to integrate additional knowledge from other KGs.It is different from KG self-completion which adds missing knowledge (concepts and properties) into the reference KG without exploiting other resources.In the context of the semantic web, most of the current cases locate the corresponding entity pairs in KGs and then directly integrate KGs by taking their union [34].Several approaches for integrating schemas, given in terms of theories of classical first-order logic and rule bases, have been proposed.They either cast rules into classical logic or limit the interaction between rules and schemas [22,50].Bruijn et al. [11] presented three embeddings for ordinary and disjunctive nonground logic programs under the stable model semantics to select KGs for integration.Wiharja et al. [55] improved the correctness of KG combinations based on Schema Aware Triple Classification (SATC), which enables sequential combinations of KG embedding approaches.However, these approaches are still limited when applied in practice.Novel methods for automatic schema integration are needed to extend general-domain KGs in a efficient and accurate way.

Conclusions
In this paper, we have proposed an ML-based framework for KG alignment and extension via a set of novel property-based similarity metrics.Firstly, we introduce a KG formalization method, which encodes etypes/entities and their corresponding properties into FCA contexts.We discuss that the corresponding properties are used to intentionally describe etypes, which provides us with a novel insight for identifying etypes.Then we propose three metrics for measuring the contextual similarity between reference etypes and candidate etypes/entities, namely the horizontal similarity Sim H , the vertical similarity Sim V , and the informational similarity Sim I .Based on our proposed metrics, we introduce the framework with detailed algorithms and modules for KG alignment and extension.Thus, we validate our framework for the corresponding tasks.Compared with the state-of-the-art methods, the experimental results show the validity of the similarity metrics and the superiority of the proposed KG alignment methods, both quantitatively and qualitatively.Our future work will focus on exploring the further utilization of the proposed similarity metrics in tasks such as KG refinement and completion, including complex KG matching cases and resolving conflicts during the KG extension process.This exploration is expected to enhance the depth and accuracy of knowledge extraction and increase the practical applicability and adaptability of our proposed methods.

Fig. 1 :
Fig. 1: An example of the hierarchical schema in KG.
E a , E b as the candidate etypes from A and B respectively, thus E a ∈ A and E b ∈ B; prop(E) refers to the properties associated with the specific etype and |prop(E)| is the number of properties in prop(E); SP C ET G (•) represents the specificity measurements we defined above, SP C(•) = {HS(•), V S(•), IS(•)}, thus SP C A (E a , p i ) and SP C B (E b , p i ) refer to the specificity of the aligned property p i in A and B respectively; k is the number of aligned properties which are associated with both etype E a and E b .As a result, we obtain three similarity metrics which are horizontal similarity Sim H , vertical similarity Sim V and informational similarity Sim I .

Fig. 3 :
Fig. 3: The framework of KG alignment and extension.

Algorithm 1
Calculating horizontal similarity Sim H between reference and candidate KGs.L H = SimilarityCalculation(f a , f b ) Input: Reference and candidate FCA contexts fa, f b ; Output: List of all horizontal similarities LH ; 1: P M = (pa × p b ) = P ropertyM atcher(fa, f b ); {P M is formed as a set of aligned property pairs, where pa ∈ fa, p b ∈ f b .}2: EM = (Ea × E b ) = EtypeSelector(fa, f b ); {etypes Ea, E b from fa, f b are assembled as candidate calculation pairs EM .}3: for all (Ej, E k ) ∈ EM do 4: simH (Ej, E k ) = 0; {initialize the value of horizontal similarity simH (Ej, E k ).} 5:

Table 1 :
Shared Properties of etype Person across different KGs.definitions of Person, as in OpenCyc and DBpedia, share twenty-three properties.Meanwhile, we also present concrete shared properties of etype Person across different KGs in Table 1, e.g., birth and education are applied in both OpenCyc and DBpedia. the {calculate the corresponding specificity HS for the similarity simH (Ej, E k ).} * simH (Ej, E k )); {stack the value of horizontal similarity to the list LH .}11: end for 12: return LH Algorithm 2 Extending reference KG by integrating a candidate KG.KG ext = KGExtension(KG ref , KG cand , EM ali ) Reference and candidate knowledge graphs KG ref , KG cand ; Aligned etype pairs EM ali ; Output: The extended reference knowledge graph KGext; 1: for all (Ea, E b ) ∈ EM ali do 2: Ea.addP roperty(E b .property);{Merging the properties of etype E b into Ea, where Ea ∈ KG ref , E b ∈ KG cand .} 3: Ea.addEntities(E b .Entities); {Merging the entities of E b into Ea.} 4: for all E sub ∈ E b .subClassdo 5: Ea.addSubClass(E sub ); {Merging the etypes E sub and their entities into Ea.}E ref = listEtypes(KG ref ); 10: Ent cand = listEntities(KG cand ); 11: for all Enti ∈ Ent cand do 12: if Enti / ∈ EM ali then 13: En = EtypeRecognizer(Enti, E ref ); {Recognizing the etype En of candidate entities Enti, where En, E ref ∈ KG ref .

Table 2 :
Representative samples of property-based similarity SimV , SimH and SimI on etypeetype pairs.

Table 7 :
Ablation study on property-based similarity metrics.

Table 8 :
Ablation study on the constraint factor λ.The best and second-best results are highlighted in red and blue, respectively.