Optimizing Ontology Alignment through Improved NSGA-II

Over the past decades, a large number of complex optimization problems have been widely addressed through multiobjective evolutionary algorithms (MOEAs), and the knee solutions of the Pareto front (PF) are most likely to be ﬁtting for the decision maker (DM) without any user preferences. This work investigates the ontology matching problem, which is a challenge in the semantic web (SW) domain. Due to the complex heterogeneity between two diﬀerent ontologies, it is arduous to get an excellent alignment that meets all DMs’ demands. To this end, a popular MOEA, i.e., nondominated sorting genetic algorithm (NSGA-II), is investigated to address the ontology matching problem, which outputs the knee solutions in the PF to meet diverse DMs’ requirements. In this study, for further enhancing the performance of NSGA-II, we propose to incorporate into NSGA-II’s evolutionary process the monkey king evolution algorithm (MKE) as the local search algorithm. The improved NSGA-II (iNSGA-II) is able to better converge to the real Pareto optimum region and ameliorate the quality of the solution. The experiment uses the famous benchmark given by the ontology alignment evaluation initiative (OAEI) to assess the performance of iNSGA-II, and the experiment results present that iNSGA-II is able to seek out preferable alignments than OAEI’s participators and NSGA-II-based ontology matching technique.


Introduction
Over the past decades, a large number of complex optimization problems have been widely addressed through multiobjective evolutionary algorithms (MOEAs) [1][2][3]. Generally, the objectives are clashing, preventing concurrent optimization for each objective, which is a challenging and realistic issue [2]. e general approach is to obtain a set of solutions, the so-called Pareto front (PF), that do not dominate each other, and the knee solution of PF is most likely to be fitting for the decision maker (DM) without any user predilections [4,5]. In this scenario, the problem of ontology alignment in the domain of semantic web (SW) is studied. Ontology, as the kernel technique in SW, can represent a formal definition on the domain knowledge. And, matching ontologies can help find their heterogeneous entities, which can speed up the translation discovery and integrate the knowledge [6]. However, due to the complex heterogeneity in different ontologies, it is arduous to get an excellent alignment to reach all DMs' demand and the ontology matching process usually needs a trade-off between two objectives, i.e., the precision and recall of the obtained alignment.
To meet the diverse requirements of various DMs, a famous multiobjective EA (MOEA), i.e., nondominated sorting genetic algorithm (NSGA-II) [7][8][9][10], has been proposed to deal with the ontology matching problem which determines the nondominated solutions and outputs the knee solution of PF as the representation one. Meanwhile, a variety of hybrid optimization approaches have been introduced in the recent years to improve the accuracy and speed of convergence to true optimum solutions which integrate EA with the local search algorithm (LS) [3,[11][12][13].
is combination allows gaining high diversity of population due to the high optimization ability which can augment the speed of convergence and reduce the probability of prematurity constringency. In this study, to efficiently determine the knee solution of PF, we propose an improved NSGA-II (iNSGA-II) which incorporates into NSGA-II's evolutionary process the monkey king evolution algorithm (MKE) [14] as the LS algorithm. e proposed iNSGA-II is able to better converge to the real Pareto optimum solution and ameliorate the quality of the solution. e remainder of the paper is organized as follows: Section 2 demonstrates the related work. Section 3 provides definition of the ontology matching problem and detailed depiction of similarity measures and the aggregation strategy; Section 4 shows the iNSGA-II-based ontology matching technique in detail; Section 5 demonstrates the experimental results; and eventually, Section 6 delineates the conclusion.

Related Work
A variety of EA-based matchers with the trait of effectively tackling the ontology alignment problem have been introduced in the recent years. e first category of the EA-based matcher is for the purpose of addressing the problem of ontology metamatching aiming at optimizing the parameters in aggregating diverse ontology matchers' alignment. Naya et al. [15] first proposed the approach by using EA where several similarity measures were combined into one to improve the quality of alignments, and its encoding mechanism is still widely utilized. After that, a memetic algorithm is proposed by Acampora et al. that incorporates the local perturbation into EA to enhance the performance of the algorithm [11]. Acampora et al. [7] and Xue et al. [10] applied the NSGA-II to optimize the alignment whose results were better than the genetic algorithm. Biniz et al. proposed a hybrid approach that combines NSGA-II and a neural network, and its results were effective. Xue et al. [3] proposed that the MatchFmeasure can be approximated to an f-measure for better instructing the algorithm's optimization track without using the reference alignment. e second category of the ontology matching technique based on EA devotes to determine the optimum alignment set of entities. Wang et al. [16] proposed the GAOM (genetic algorithm-based ontology matching) that first models a discrete optimization model for optimizing the mapping set. Chu et al. [17] presented a new metric that took into consideration the entity information in the vector space, and then an EA-based matcher was applied to enhance the alignment quality. A memetic algorithm was proposed by Xue et al., which was effective in instance coreference resolution [12]. Several approaches have been proposed which can efficiently determine the alignments using EA involving user coordination [13,18,19]. Xue et al. [20] proposed the CETS (compact evolutionary Tabu search algorithm) to match sensor ontologies, and Acampora et al. [21] compared the performance of five local search algorithms in the ontology matching problem and found that Tabu search results presented the best performance. e proposed iNSGA-II belongs to the first category. Because of the characteristics of exploitation and exploration of MKE, our proposal incorporates NSGA-II, with the best trade-off between precision and recall, and MKE, as a local search algorithm, to further ameliorate the results' quality and reach DMs' demand.

Ontology Matching
Problem. An ontology consists of classes, data type properties, and object properties, which are generally called ontology entities [22]. e purpose of ontology matching is determining the entity correspondence set, which is the so-called ontology alignment. Traditionally, an f-measure was often utilized to assess the quality of alignments, which is defined as follows: where R is the reference alignment originated from domain experts and A is the alignment derived from the ontology matcher. Metrics of recall and precision present the alignments' completeness and correctness, which usually balance each other through the reconciliation mean of them, i.e., the alleged f-measure. But in the real world, obtaining the reference alignments is particularly expensive, especially when handing the large-scale ontologies. In this paper, we utilize three approximate measures, i.e., MatchCover [3], MeanSimilarity [17], and f-measure, to roughly approximate recall, precision, and f-measure, respectively. e motivation of MatchCover and MeanSimilarity is to map more identical entities and to ensure the higher similarity of aligned entities, respectively. Assuming the golden alignment's cardinality is one to one, three approximate measures are, respectively, defined as follows: where |O 1 | and |O 2 | are, respectively, the cardinalities of O 1 and O 2 ; |O 1 − match| and |O 2 − match| are the number of matched entities in two ontologies, respectively; |A| is the number of correspondence; simValue i is the ith entity correspondence's similarity value. Eventually, we define the multiobjective optimization model for the ontology matching problem as follows: 2 Discrete Dynamics in Nature and Society where f 1 (X) and f 2 (X) are the MatchCover and Mean-Similarity of X, respectively; n is the number of similarity measures utilized; x i represents the corresponding weight, i � 1, 2, . . . , n; x n+1 is the threshold for the purpose of filtering the final alignments.

Similarity Measures.
Typically, similarity measures can be classified into terminology-based, semantic-based, and structure-based measures.

Terminology-Based Measures.
Terminology-based measures calculate the string distance between entity identifiers, labels, and comments of two ontologies. ere are various terminology-based measures, for instance, Levenshtein distance [23] and Jaro-Winkler distance [24]. In this work, the widely used terminology-based measure is employed, i.e., Levenshtein distance. e Levenshtein distance can be defined as the following equation by the given two strings s 1 and s 2 : where |s 1 | and |s 2 | are, respectively, the string lengths of s 1 and s 2 and d(s 1 , s 2 ) is the quantity of operation necessary to transform s 1 to s 2 .

Semantic-Based
Measures. e similarity value determined by the semantic-based measures takes into account the semantic information. With the consideration of the entities' identifiers, we employ the WordNet [25], i.e., an electronic vocabulary database that combines different words into a group of synonyms, to calculate the distance based on linguistic relationships, such as synonymy and hypernym. e linguistic distance, Sim Lin (w1, w2), is represented as follows by the given two words w1 and w2: 1, if two words are synonymous, 0.5, if one word is the hypernym of the other, 0, otherwise.
Another semantic-based similarity method used is the cosine distance which is expressed by means of the magnitude and dot product from two vectors [26]. Machine learning has a wide range of application scenarios [27][28][29][30]; in this study, we employ the trained Word2Vec model (https:// mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/) that trains through a Google News dataset with roughly 100 billion words. Given two entity vectors V 1 and V 2 , the cosine distance is defined as follows: In our study, we compute the structure-based measure that says "elements are similar in two different ontologies if they are related to similar elements." Particularly, the structure-based distance is computed by using the famous algorithm, similarity flooding (SF) [31], where an iterative fixpoint computation (see equation (7)) is used to generate correspondence between the elements of two ontologies: where δ i is the value of the last iteration changed in each iteration and f is a function that enables a similarity value of an element pair to increment in accordance with the similarity of its adjacent elements. For more information for the SF algorithm, please see [31].

Aggregation Strategy.
A straightforward but practical aggregation strategy is utilized in this study, as an averaged weight method to integrate all mentioned similarity measures above. e method can be described as follows: where e 1 and e 2 are the entities from different ontologies; n is the number of similarity approaches considered; w i is the weight of the corresponding similarity measure; sim(e 1 , e 2 ) is the instance of the represented similarity function above.

iNSGA-II for Optimizing Ontology Alignment
In this study, we apply the iNSGA-II to tackle the ontology matching problem. As a flexible and robust approach, NSGA-II can quickly find various nondominated solutions.
To further improve the probability of true convergence and enhance the quality of the solutions, as the character of exploitation and exploration of MKE, we introduce the MKE as the local search algorithm. We depict in detail the encoding mechanism, genetic operators, and local search algorithm in the next sections.

Encoding Mechanism.
At the beginning of iNSGA-II, a random parent population is engendered through decimal encoding. e chromosome of an individual can be split into two parts, one representing the several weights and the other representing the threshold. Supposing p is the number of weights we need, then the several cuts could be expressed as c ′ � c 1 ′ , c 2 ′ , . . . , c p−1 ′ . e chromosome decoding is carried out by ordering c ′ from lower to higher, and then we get c � c 1 , c 2 , . . . , c p−1 and computing the weights as follows: Subsequently, the chromosome length is (n−1)·cutLength + thresholdLength, where n is the number of required weights and cutLength and thresholdLength are the chromosome lengths of the cut and the threshold, respectively. Figure 1 demonstrates the encoding and decoding mechanisms of several weights.

Genetic Operators.
e iNSGA-II utilizes the following genetic operators to generate an offspring population.

Selection.
e selection operator's target is to select two parents that can be utilized in the crossover operator. In this study, two individuals are selected randomly from parent population as parent 1 and parent 2 , respectively.

Crossover.
According to the selection operator, we select two chromosomes in the population as parents, i.e., parent 1 and parent 2 . We check whether the crossover operator could be applied on the basis of the crossover probability, that is, the parameter in the algorithm. en, two children can be engendered according to the following formula: where child i is the i th generated individual and rand i is a random number in the interval [0, 1].

Mutation.
For the purpose of preventing premature convergence and assuring the diversity of population, the mutation operator is utilized. By means of the mutation probability, the newly generated individual can be produced as the following formula: where rand is a random number in the interval [0, 1] and r is a random number to determine whether Indi old should become bigger or not. Furthermore, the formula ensures the newly produced individual in the interval [0, 1].

Generation of New Population.
For the sake of speeding up the algorithm's convergence and ensuring the diversification of population, the generation of a new parent population chooses half of the chromosomes from the front end of the population which combines parent population and offspring population by using nondomination ranks and crowing distance. More details can be seen in [1].

Local Search Algorithm.
After the new parent population is engendered, to efficiently determine the knee solutions of PF, we apply the MKE for local perturbation on each knee solution in the Pareto front when the knee solution does not change in five generations. In this study, the knee solution is the one in the PF with the best recall or precision. In particular, the knee position which is the latent location of the Pareto front presents the greatest trade-off between objectives. For each knee solution, i.e., the best recall and precision, the adopted algorithm for local perturbation in each knee solution is the second version of MKE [14]. e design of MKE is inspired by the Chinese fairy tales, Journey to the West, and the Monkey King which plays a key role in protecting the master and solving various problems in the journey. Equation (12) denotes the equation of X and X knee,G , where X is the matrix representing the population, the ith individual X i is the ith row vector of X, s is the scale of population, X knee,G is the knee solution in the population, and X knee,G denotes the knee solution matrix with the C × D row vector: e formula of updating the knee solution is defined as follows: and the cardinality of dimensions; X r1 and X r2 are randomly produced by choosing the C × D row vector from the population X; X diff is the difference matrix engendered by the disparity of two random matrices X r1 and X r2 ; FC is the fluctuation coefficient of the difference matrices; X mk (i) is the ith row vector in X mk , i � 1, 2, . . . , C; X knee,G+1 is the optimum row vector in X mk (i) and X knee,G , i.e., the individual with the highest fitness. e flow of determining the alignments is shown in Figure 2.

Experiment
In this study, the experiment is done by applying the famous benchmark track given by the ontology alignment evaluation initiative (OAEI) (http://oaei.ontologymatching.org/2016/). In the benchmark track, each testing case comprises two tobe-mapped ontologies and the reference alignment, to assess the effectiveness of the ontology matcher. Table 1 provides a concise statement of OAEI's benchmark track.

Experimental Configuration.
In this study, we conduct the experiment by means of the four mentioned similarity measures above, i.e., Levenshtein distance, linguistic distance, cosine distance, and structure distance. e similarity measures should be applied to both input ontologies, and the results are saved in the CSV file before the execution of iNSGA-II. is is done to avoid recalculating the similarities and to boost efficiency of the matching process. e proposed iNSGA-II utilizes the following parameters which provide the trade-off settings acquired empirically to fulfil the highest average quality of alignments across all testing cases in the exploited dataset. Using this parameter configuration, the experiment of this study proves that the selected parameters are robust to all the heterogeneous issues between two ontologies and are expected to be robust for the universal situations of heterogeneity in the real world.

Experimental Results.
For the purpose of comparing the quality of our approach with the NSGA-II-based ontology matcher and OAEI's participators (http://oaei. ontologymatching.org/2016/results/benchmarks/index. html), the obtained alignments have been assessed with conventional recall, precision, and f-measure. e results of NSGA-II and iNSGA-II are mean values of thirty independent runs, and we choose the individual of the best f-measure from knee solutions as a representative. Table 2 displays the mean values and standard deviation of the produced three metrics from our proposal and NSGA-II. Table 3 shows the t-test statistical analysis [32] of values in Table 2. It can be seen from the tables that in all testing cases, as the characteristics of exploitation and exploration of MKE, the quality of alignments obtained by iNSGA-II is better than or equal to NSGA-II.
In Figure 3, X-axis represents different ontology matchers and Y-axis shows the value of f-measure, precision, and recall. Because of iNSGA-II which takes both recall and precision into consideration and the exploitation of the local search, it can be seen from Figure 3 that the f-measure of our proposal is 0.92, which outperforms all the OAEI's participators and NSGA-II in terms of the f-measure. Figure 4 displays that our proposal is more effective than NSGA-II by compared evaluation times of the f-measure value reaching 0.6.
From the above, the quality of alignments of iNSGA-II is better than that of NSGA-II and OAEI's participators, which demonstrates its better trade-off on the algorithm's exploration and exploitation. To conclude, iNSGA-II can efficiently determine high-quality ontology alignments when matching various heterogeneous ontologies.

Conclusion
Ontology alignment can find heterogeneous entities of diverse ontologies which can speed up the translation discovery and integrate the knowledge. In this study, the iNSGA-II, which incorporates into NSGA-II's evolutionary process the monkey king evolution algorithm (MKE) as the LS algorithm, is introduced to match the heterogeneous entities which can effectively determine the knee solution of PF enhancing the quality of alignments. By means of the benchmark track of OAEI, the experiment results show that our proposal outperforms the NSGA-II-based ontology matching technique and the OAEI's participators. To further enhance the quality of alignments, in the future, we will focus on semiautomatic ontology matching techniques which introduce interacting process into the MOEA-based matcher, and we will be interested in the word embeddingbased similarity measure which would take background knowledge and entities' information of ontologies into consideration to improve the performance of ontology matching [33] techniques.

Data Availability
e data used to support this study can be found in http:// oaei.ontologymatching.org.

Conflicts of Interest
e authors declare that they have no conflicts of interest related to this work.