Heterogeneous Information Network-Based Scientific Workflow Recommendation for Complex Applications

Scientiﬁc workﬂow is a valuable tool for various complicated large-scale data processing applications. In recent years, the increasingly growing number of scientiﬁc processes available necessitates the development of recommendation techniques to provide automatic support for modelling scientiﬁc workﬂows. In this paper, with the help of heterogeneous information network (HIN) and tags of scientiﬁc workﬂows, we organize scientiﬁc workﬂows as a HIN and propose a novel scientiﬁc workﬂow similarity computation method based on metapath. In addition, the density peak clustering (DPC) algorithm is introduced into the recommendation process and a scientiﬁc workﬂow recommendation approach named HDSWR is proposed. The eﬀectiveness and eﬃciency of our approach are evaluated by extensive experiments with real-world scientiﬁc workﬂows.


Introduction
Scientific workflow is an effective and important means to deal with data-intensive, computation-intensive, and collaboration-intensive scientific issues in many large-scale complex systems or applications from domains such as physics, astronomy, chemistry, bioinformatics, and life sciences [1][2][3]. In practice, many scientific workflows have been successfully deployed and executed on clouds. Recently, with the quick development of smart user devices and edge computing, a number of studies have been carried out to construct and execute workflows in a cloud-edge collaborative manner [4,5].
Scientific workflow modelling plays an important role in complex scientific workflow applications, which is a not only complex but also error-prone process. In recent years, more and more scientific workflows have been published onto the Web and shared in some repositories such as CrowdLabs, SHIWA, Galaxy, and the myExperiment [6,7]. People can leverage and repurpose a part of existing scientific workflows for specific complex applications, rather than constructing new ones from scratch. However, with the growth of the amount of scientific workflows, finding suitable scientific workflows from a sea of candidates becomes a new problem for scientists and engineering personnel. ough process retrieval methods can help to handle this problem by retrieving similar scientific workflow fragments from repositories, much manual work is still required. Consequently, to provide better automatic support, it is necessary to build effective scientific workflow recommendation techniques, which is fundamental for the reuse and repurposing of current scientific workflows.
In scientific workflow repositories, various types of data can be used for recommendation, including scientific workflow structure and annotation. However, the tags of scientific workflows are usually neglected by existing scientific workflow recommendation methods. In fact, the tags of scientific workflows contain much valuable information and different underlying logical relations among scientific workflows which can be explored via them. For example, many tags in the myExperiment repository are substantially shared by multiple scientific workflows and there exist partial similarity relations among these scientific workflows.
erefore, integrating tags and other information of scientific workflows is promising to generate more accurate recommendations.
On the other hand, heterogeneous information network (HIN) has been proved to be a powerful modelling method to incorporate various heterogeneous types of information and it has been successfully applied in recommender systems [8,9]. Motivated by the HIN-based recommendation idea and data characteristics of scientific workflow repository, we plan to integrate multiple types of scientific workflow data into the form of HIN and use a metapath-based technique to measure similarity and calculate distance between scientific workflows, by which multiple metapaths can be combined with the semantic description information of scientific workflows and more accurate similarity computation results would be obtained.
With these observations, in this paper, we propose a heterogeneous information network-based approach for recommending scientific workflows to scientists and engineering personnel. In our approach, different data objects and underlying logical relations on scientific workflows are organized as a HIN, according to which the similarity between scientific workflows is evaluated. In addition, to facilitate the reuse and repurposing of current scientific workflows, the density peak clustering (DPC) algorithm [10] is introduced and used to group candidates into clusters. Our main contributions are summarized as follows: (1) We propose a new representation form of scientific workflow based on HIN, which is enriched through incorporation of multiple types of data including tags and logical relations of such data (2) We build a metapath-based method to assess the similarity between scientific workflows, where the similarity is calculated according to objects of tag, description, activity, and subscientific workflow involved in scientific workflows (3) We present a HIN-and DPC-based scientific workflow recommendation approach named HDSWR to generate more accurate recommendations and, on the basis of it, to facilitate the reuse and repurposing of current scientific workflows for scientists and engineering personnel (4) We provide two real-world datasets with tags on scientific workflows for experiments e remainder of this paper is organized as follows. Section 2 describes the related studies. Section 3 introduces some notations and basic definitions used in the paper. Section 4 presents the scientific workflow similarity computation method. In Section 5, we propose the HDSWR approach.
en, we evaluate our method in Section 6. Section 7 concludes this paper.

Related Work
In this section, we briefly review related work on the workflow models, workflow recommendations, and HIN.
A workflow model is fundamental for various workflow applications. In practice, workflows can be modelled by different tools such as directed acyclic graphs (DAGs), Petri nets, event-driven process chains (EPCs), the business process execution language (BPEL), or the fairly complex business process modelling notation (BPMN) language which has over 100 symbols [11]. However, modelling workflows is always a knowledge-intensive and laborious task. To improve workflow modelling, methods such as workflow mining [12] have been proposed to discover workflow models from event logs. However, similar to process retrieval, much manual work is still involved.
In recent years, some workflow recommendation approaches have been proposed. Current techniques can be mainly classified into two types: business workflow (process) recommendation and scientific workflow recommendation.
In the business process management domain, business workflow is usually modelled with block structures including sequential structures, alternative structures, parallel structures, and iterative structures. So far, only a limited number of business workflow recommendation methods have been proposed to serve different purposes, which can be classified into complete process recommendation and process fragments (nodes) recommendation [13]. For example, Zhang et al. [14] leveraged workflow provenance to recommend a set of nodes for a partial workflow. Li et al. [15] adopted minimum depth-first-search codes and string edit distances for representing and recommending business workflow fragments. Deng et al. [13] developed a recommendation system to generate a sorted candidate node sets, which used a subgraph mining method to extract patterns from process repositories. Wang et al. [16] utilized the properties of business process repositories and proposed a representationlearning-based recommendation method.
Scientific workflows are based on the automation of scientific process which is typically composed of multiple scientific programs or Web services. Compared with business workflows, scientific workflows have a strong focus on the dataflow to sufficiently support a variety of data-intensive applications, in which the control structure just simply describes the partial ordering of tasks. erefore, scientific workflows are usually modelled with unstructured DAGs, which conceptually use a set of nodes and edges instead of complex block structures. However, similar to business workflow recommendation, there are two kinds of work in scientific workflow recommendation. For instance, Zhang et al. [17] used the term of unit of work (UoW) to represent a collection of services (i.e., fragments of a scientific workflow) chained together, based on which a UoWdriven scientific workflow recommendation framework and three algorithms for UoW mining and recommendation are proposed. Cheng et al. [18,19] converted a scientific workflow into a lay hierarchy in terms of a tree style, where the hierarchical relations specify the links between a scientific workflow, its subworkflows, and activities. Based on it, a semantic similarity computation algorithm considering the lay hierarchy and description of scientific workflows is proposed for clustering and recommending appropriate scientific workflows. Krzywucki and Polak [20] utilized semantic-type comparison to evaluate the similarity of scientific workflows. Bergmann et al. [21] proposed a 2 Complexity semantic workflow graph-based method for modelling scientific workflow similarity and developed an A * searchbased algorithm for workflow similarity computation. Starlinger et al. [7] presented a layer decomposition approach for the comparison and similarity search of scientific workflow. Mohan et al. [22] developed several folksonomybased scientific workflow recommendation strategies and implemented them in a prototype system. HIN is a newly emerging direction in recommender systems and a good candidate for improving the accuracy of recommendations. However, to the best of our knowledge, HIN is normally neglected in the workflow recommendation literature. So far, most of the HIN-based recommendation methods consider the metapath-based similarity. For example, Sun et al. [8] investigated a similarity search problem in HIN and introduced the concept of metapath-based similarity. Zhao et al. [23] introduced the concept of metagraph to incorporate more complex semantics for HINbased recommendation. Shi et al. [24] developed a metapath-based random walk strategy and proposed a HIN embedding-based recommendation algorithm. On the other hand, scientific workflows in repositories have rich tag information, which are seldom exploited by existing workflow recommendation methods. Some research work related to tags has been done in the domain of Service Computing [25] and other related research work on service recommendation was carried out in [26]. Our previous work in [27] has preliminarily utilized scientific workflow tags for recommendation. In this paper, we further organize scientific workflows and their relations as a HIN to calculate the similarity of scientific workflows and generate more accurate recommendations.

Preliminaries
To make our approach well understood, we first introduce HIN and relevant concepts in this section. e notations we will use throughout this paper are summarized in Table 1.
Definition 1 (Scientific Workflow [18]). A scientific workflow sw is a tuple (nm, sw_dsc, sw_D, sw_A, sw_L, and sw_T), where nm and sw_dsc are the name and text description of sw, respectively. sw_D is the set of subscientific workflows that sw invokes. sw_A is the activity set of sw. sw_L denotes a set of links connecting activities and subscientific workflows in sw. sw_T is a set of tags on sw.
Generally, a subscientific workflow can be regarded as a scientific workflow [7]. For example, in the myExperiment repository, a subscientific workflow is stored as an independent scientific workflow. [24,28]). A heterogeneous information network is defined as a direction graph G � (V, E) with an object-type mapping function ϕ: V ⟶ B and a link-type mapping function ψ: E ⟶ R, satisfying |B| + |R| > 2. e scientific workflow can be organized and represented as a heterogeneous information network, which contains five object types: scientific workflow (denoted as SW), tag (denoted as T), activity (denoted as A), subscientific workflow (denoted as D), and description (denoted as dsc). Each scientific workflow can link with a set of tags, a set of activities, and a set of subscientific workflows, and a description.
Besides, sw 1 and sw 2 are linked by two tags (cheminformatics and chemspider), which are shared by sw 1 and sw 2 . Similarly, if some objects of subscientific workflow, activity, or description are shared by two scientific workflows, there exists some link relation between these two scientific workflows.
Definition 4 (Network Schema [24,28]). e network schema is a meta template for a heterogeneous information network G � (V, E) with the object-type mapping function ϕ: V ⟶ B and the link-type mapping function ψ: E ⟶ R, which is a directed graph S � (B, R) defined over object types B and link types R.
According to Definition 4, we can construct a HIN-based scientific workflow representation schema, which is shown in Figure 2.
ere are five types of objects: scientific workflow (SW), tag (T), activity (A), subscientific workflow (D), and description (dsc). Besides, there exist four types of links between objects to represent different relations: (1) A link relation between a scientific workflow and a tag. (2) A link relation between a scientific workflow and an activity. (3) A link relation between a scientific workflow and a subscientific workflow. (4) A link relation between a scientific workflow and a description. Such link relation is single-way, because a specific text description belongs to a specific scientific workflow.
Definition 5 (Metapath [8,24]). A metapath p is a path defined on a network schema S � (B, R) and is represented in the form of B 1 ⟶ B 2 ⟶ · · · ⟶ B l+1 and thus defines a composite relationship R � R 1°R2°· · ·°R l between two object types B 1 and B l+1 , where°denotes the composition operator on relations R.

Complexity
According to Definition 5 and the HIN-based scientific workflow representation schema, we can construct four types of metapaths, which are shown in Figure 3:   An activity type of object T A tag type of object p 1 , p 2 , p 3 , p 4 Different types of metapaths SWT, SWA, SWD Adjacent matrices on the objects of tag, activity, and subscientific workflow, respectively Feature vectors of scientific workflow sw i in the adjacent matrices SWT, SWA, and SWD, respectively

Similarity Computation for Scientific Workflows
Based on the basic definitions mentioned above, we propose a novel scientific workflow similarity computation method in this section. It mainly consists of four steps.
Step 1: construct three adjacent matrices on the objects of tag, activity, and subscientific workflow.
According to the objects of tag, activity, and subscientific workflow involved in the scientific workflows, we can construct three adjacent matrices, respectively, denoted as SWT, SWA, and SWD. A row of the adjacent matrices corresponds to a specific scientific workflow. A column of the adjacent matrices SWT, SWA, and SWD corresponds to a specific object of tag, activity, and subscientific workflow, respectively. e values in these three adjacent matrices can be 1 or 0, which denotes whether a specific object belongs to a specific scientific workflow. Besides, for computational convenience, we use the feature vector v T i , to represent the relation between the scientific workflow sw i and all the objects of tag involved, which corresponds to a row in the adjacent matric SWT. Likewise, we use the feature vector v A i and v D i to represent the relations between the scientific workflow sw i and the objects of activity and subscientific workflow involved, respectively, which correspond to a row in the adjacent matrices of SWA and SWD, respectively.
Step 2: Calculate the similarity on the metapaths. As mentioned in Section 3, there exist four types of metapaths. erefore, the similarity strength of sw i and sw j on meta-path p 1 can be calculated by the following equation: (1) In equation (1), v T i and v T j are two feature vectors of scientific workflow sw i and sw j on tags, respectively.
(v T j ) t is the transpose of the feature vector v T i . e higher the number of common tags between sw i and sw j , the greater the inner product of the and thus, the more the similarity between sw 1 and sw 2 on the tag. e meaning of notations in equations (2) and (3) is similar to these in equation (1). Likewise, the similarity strength of sw i and sw j on metapaths p 2 and p 3 can be obtained by equations (2) and (3), respectively, where the meaning of notations is similar to these in the following equation: Based on equation (1), we can also obtain the values of C j,j to represent the similarity between scientific workflows sw i and sw j with respect to metapath p 1 , which is described as follows: Analogously, the similarity between scientific workflows sw i and sw j with respect to metapaths p 2 and p 3 is described as follows: Step 3: Calculate the similarity value on the descriptions of scientific workflows. e doc2vec model can learn the fixed-length feature from the variable-length text [29]. erefore, we utilize the doc2vec model to form the paragraph vectors v swi and v swj for the descriptions of scientific workflows sw i and sw j , respectively. Besides, the normalized cosine similarity between v swi and v swj is calculated as the similarity value on the descriptions of scientific workflows sw i and sw j , which is described as: In equation (6), the notations v swi and v swj represent the norm of the paragraph vectors v swi and v swj , respectively.
To effectively fuse different similarities of scientific workflows obtained by the above steps, we introduce the weighting mechanism, which is described as: In equation (7), α, β, c, and δ are the weight coefficients satisfying α + β + c + δ � 1.

HDSWR Approach
To improve the accuracy and efficiency of scientific workflow recommendation, we propose an approach named HDSWR. In this section, we provide an overview of the HDSWR and introduce its related function algorithms in detail.

Overview of the HDSWR Approach.
e proposed HDSWR approach is shown in Algorithm 1, which consists of four steps: Step 1 (line 1): we construct a matrix to denote the similarity values between scientific workflows in the list SW, which may come from some scientific workflows repository. All the scientific workflows in the list SW are organized as a HIN for similarity computation.
Step 2 (line 2): we adopt the density peak clustering (DPC) algorithm [10] to group all the scientific workflows in the list SW into multiple different clusters, where the similarity values in the matrix are used as the distances between scientific workflows and Clusters denotes a set of clusters on scientific workflows.
Step 3 (lines 3-4): according to textual description in the requirement of scientists and engineering personnel, i.e., requirement.dscs, we search and choose appropriate objects of activity and subscientific workflow involved in the list SW, where D smp and A smp denote a set of subscientific workflows and a set of activities, respectively. en, a HIN-based sample scientific workflow sw smp can be constructed (Line 4).
Step 4 (line 5): according to the sample scientific workflow sw smp , we firstly select an appropriate group of scientific workflows in the set Clusters by the similarity values between sw smp and different clusters. en, a list SW rec is generated for recommendation, where the number of scientific workflows in the list SW rec is related to the parameter of rec_K.

Similarity Computation.
Assessing workflow similarity is important for workflow recommendation. Its main purpose is to measure the distances between workflows. Based on the scientific workflow similarity computation method introduced in Section 4, the function ComputeSimilarity is described as Algorithm 2.
In Algorithm 2, three adjacent matrices on the scientific workflow list SW are constructed first (lines 1-3). en, the feature vector of scientific workflows sw i and sw j is used to compute the similarity strengths on metapaths by equations (1)-(3) (lines 6-7), based on which the similarity between sw i and sw j with respect to metapaths can be obtained by equations (4)-(6) (line 8). Finally, the similarity values are obtained by equation (7) (line 9) and stored in the matrix Matrix for further clustering and recommendation (line 10).

Example 2.
e scientific workflows sw 1 and sw 2 in Figure 1 can be used as an example. As illustrated by Figure 1, there are four tags (annotation, chemspider, cheminformatics, and metabolomics) involved in the scientific workflows sw 1 and sw 2 . erefore, as shown in Figure 4(a), the corresponding value on these four tags in the adjacent matrix SWT is 1 or 0 with respect to the sw 1 and sw 2 , where the value of 0 denotes that such tag does not belong to some scientific workflow. Similarly, the matrix SWA in Figure 4

DPC-Based Clustering of Scientific Workflows.
To improve the efficiency of recommendation, we introduce the clustering strategy proposed in [10,30], by which the scientific workflows are grouped and divided into different clusters for further recommendation. Different from the work in [10,30], we choose the density peak clustering (DPC) algorithm [10] as our clustering method, because it can effectively identify clusters with different distribution shapes and it is rarely affected by noise points. Based on the DPC algorithm, the function DPCClustering can be described as Algorithm 3.
In Algorithm 3, we first initiate the matrix dist according to the matrix Matrix (line 1) and initiate the value of cutoff distance dc according to the rule of thumb introduced in [10] (line 2). en, we calculate the local density values of scientific workflows (lines 3-10) and their relative distances values (lines [11][12][13][14][15][16][17][18]. Finally, we can apply the DPC algorithm to divide scientific workflows into different clusters (line 19), where each cluster in the Clusters can be denoted as a group of scientific workflows with a scientific workflow as its cluster center.

Retrieval of Appropriate Activities and Subscientific
Workflows. According to the modelling requirement of scientists and engineering personnel, we can search in the scientific workflow list and get appropriate activities and subscientific workflows, which can be used to construct a sample scientific workflow and guide the recommendation process. Such procedure is performed by the function GetActivity_SubWF, which is described as Algorithm 4.
In Algorithm 4, because the descriptions requirement.dscs provided in the requirement are related to activities or subscientific workflows, the best matching result on each description in requirement.dscs may be an activity or a subscientific workflow. erefore, we calculate the similarity values on activities and subscientific workflows, (1) SWT ⟵construct the adjacency matrix of SW on tag objects (2) SWA ⟵ construct the adjacency matrix of SW on activity objects (3) SWD ⟵ construct the adjacency matrix of SW on sub-scientific workflow objects (4) for each scientific workflow sw i in SW do (5) for each scientific workflow sw j in SW do (6) obtain calculate sim p1 (i, j), sim p2 (i, j), sim p3 (i, j), sim p4 (i, j) (9) calculate sim(i, j) (10) Matrix i,j ⟵ sim(i, j) (11) end for (12) end for (13) return Matrix ALGORITHM 2: Function ComputeSimilarity (SW, α, β, c, δ).  (1) dist ⟵ 1 − Matrix (2) dc ⟵ select a value from the dist so that the number of values below it is around 1 to 2% of the total number of values in the dist (3) for each scientific workflow sw i in SW do (4) ld i ⟵ 0 (5) for each scientific workflow sw j in SW do (6) if dist i,j < dc then (7) ld i ⟵ ld i + 1 (8) end if (9) end for (10) end for (11) for each scientific workflow sw i do (12) rd i ⟵ max(dist) (13) for each scientific workflow sw j do (14) if ld i < ld j and rd i < dist i,j then (15) rd i ⟵ dist i,j (16) end if (17) end for (18) end for (19) Clusters ⟵ clustering scientific workflows by the DPC algorithm with the local density values such as ld i and relative distances values such as rd i (20) return Clusters ALGORITHM 3: Function DPCClustering (Matrix, SW). 8 Complexity respectively, where the working procedure of the function cosine_sim in lines 6 and 13 is similar to that of equation (6). Besides, for each description in requirement.dscs, we search the best matching activity (lines 5-11) and the best matching subscientific workflow for it (lines 12-18), then we choose the better one for constructing a sample science workflow (lines 20-24).

Generation of Scientific Workflow Candidate List.
Once a sample science workflow is constructed, we can generate a list of scientific workflows that are most relevant to it, the whole procedure of which is described as Algorithm 5.
e Algorithm 5 mainly consists of three steps.
Step 1 (line 1): as introduced before, we can construct the feature vectors of the sample scientific workflow sw smp on the objects of activity, subscientific workflow, and tag.
Step 2 (lines 3-14): we compute the similarity between the sample scientific workflow sw smp and the cluster center scientific workflow first (lines [4][5][6][7][8], where the procedure is performed according to the method introduced in Section 4. en, a cluster is selected as cluster smp if the similarity value between its cluster center and sw smp is the largest among all the clusters (lines 9-13).
Step 3 (line 15): after the cluster cluster smp is determined, the rec_K% scientific workflows of the cluster smp which are most related to sw smp in similarity values are chosen as candidate scientific workflows and recommended in a list.

An Example on Textual
Descriptions. So far, research studies for recommending whole scientific workflows typically adopt the scientists' requirements for recommendation. For example, Cheng et al. [18] used a layer hierarchy with respect to the scientist's requirement. In our approach, we mainly adopt textual descriptions with respect to the scientist's requirement. For ease of illustration, the scientific workflow sw 1 in Figure 1 is used as an example on textual descriptions. Figure 1, there exists a subscientific workflow named CNTCI, which is short for Chemical_Name_To_Chemspider_ID, and a subscientific workflow named Workflow40 in the scientific workflow sw 1 . We can get the textual descriptions of the sw 1 , i.e., " is workflow will map a chemical name or identifier to uniform Input: (i) requirement.dscs: a list of descriptions on activities and subscientific workflows. (ii) SW: a list of scientific workflows. Output: D smp : a set of subscientific workflows A smp : a set of activities (1) D smp ⟵ ∅, A smp ⟵ ∅ (2) for each dsc in dscs do (3) sim tmp1 ⟵ 0, sim tmp2 ⟵ 0 (4) for each sw in SW do (5) for each activity a in sw do (6) sim ⟵ cosine_sim (doc2vec (dsc), doc2vec (a)) (7) if sim > sim tmp1 then (8) sim tmp1 ⟵ sim (9) a tmp ⟵ a (10) end if (11) end for (12) for each sub-scientific workflow d in sw do (13) sim ⟵ cosine_sim (doc2vec (dsc), doc2vec (d)) (14) if sim > sim tmp2 then (15) sim tmp2 ⟵ sim (16) d tmp ⟵ d (17) end if (18) end for (19) end for (20) if sim tmp1 > sim tmp2 then (21) append a tmp to A smp (22) else (23) append d tmp to D smp (24) end if (25) end for (26)

Complexity resource identifiers (URIs). First the ChemSpider web service is used to map the chemical name to a ChemSpider identifier, then the ChemSpider identifier is mapped to URIs via the Open PHACTS platform."
According to the textual descriptions of the sw 1 , we can use the doc2vec model to learn the sequence relationship between the subscientific workflows of CNTCI and Workflow40. Furthermore, by this way, similar structural information involved in scientific workflows can also be obtained and used for retrieval of appropriate activities and subscientific workflows, some of which can be performed with the function cosine_sim in Algorithm 4. Similarly, logical relationships involved in the components of scientific workflows can also be clearly described in the scientist's requirement. erefore, though these structural features are not explicitly expressed in the form of HIN, they are implicitly considered and used in our proposed approach for generating more accurate recommendations.

Experiments
In this section, a series of experiments are performed to answer two questions: (1) Compared with the state-of-theart scientific workflow recommendation techniques, does our approach have better performance? (2) What is the performance of our HDSWR approach in the presence of different parameters and datasets used for recommendation?
All experiments are performed on a computer with Intel (R) Core (TM) i5-7300HQ CPU@ 2.50 GHz 2.50 GHz and 8 GB memory running Window 10, JDK 1.8.0 and python 3.5. Next, we focus on experimental evaluations of these two questions.

Datasets.
e myExperiment is a widely used scientific workflow repository supporting the publication and sharing of scientific workflows. It also allows scientists to search scientific workflows related to their research and then reuse and repurpose scientific workflows according to their distinct needs [31].
ere are various types of scientific workflows in the myExperiment, such as Tarvena1 and Tarvena2. We crawled related data on the Tarvena2 type of scientific workflows from the myExperiment and created two datasets named SW#80 and SW#236 accordingly. e datasets used in our experiments are publicly accessible from GitHub via the website: https://github.com/yixinxunwu/ myExperiment.
As Table 2 shows, the SW#80 dataset includes 80 scientific workflows with 229 activities, 125 tags, and 85 subscientific workflows, where the number of activities contained in each scientific workflow is in the range of 3 to 20.
e SW#236 dataset includes 236 scientific workflows with 430 activities, 310 tags, and 243 subscientific workflows, where the number of activities contained in each scientific workflow is in the range of 2 to 30.

Evaluation Metrics.
To evaluate the efficiency of scientific workflow recommendations, we adopt the precision and recall measures used in [18] and the F 1 score used in [16] as our evaluation metrics, which are described as equations (8)-(10), respectively: Input: (i) sw smp : a sample scientific workflow. (ii) Clusters: the set of scientific workflow clusters. (iii) rec_K: a hyper-parameter to control the number of recommend scientific workflows. (iv) α, β, c, δ: weight coefficients. Output: (i) SW rec : a list of recommended scientific workflows.
(1) v A smp , v T smp , v D smp ⟵ construct the feature vector on the activities, tags and sub-scientific workflows of sw smp . (2) sim smp ⟵ 0 and cluster smp ⟵ ∅ (3) for each cluster ∈ Clusters do (4) sw ct ⟵ choose the cluster center scientific workflow of the cluster calculate sim p1 (smp, ct), sim p2 (smp, ct), sim p3 (smp, ct), sim p4 (smp, ct) (8) calculate sim(smp, ct) (9) sim tmp ⟵ sim(smp, ct) (10) if sim smp < sim tmp then (11) sim smp ⟵ sim tmp (12) cluster smp ⟵ cluster (13) end if (14) end for (15) SW rec ⟵ choose the top rec_K% most similar scientific workflows in cluster smp (16) return SW rec ALGORITHM 5: Function RecommendSWs (sw smp , Clusters, rec_K, α, β, c, δ). 10 Complexity recall � In equations (8)-(10), the notation SW rec represents a list of scientific workflows which are generated by recommendation algorithms, and the notation SW ept represents an expected list of scientific workflows. Similar to the work in [18], we adopt a means to generate SW ept , by which the top exc_K% most similar scientific workflows involved in a dataset are selected. Besides, the symbols |SW rec | and |SW ept | denote the numbers of scientific workflows in the SW rec and SW ept , respectively.

Methods Used for Experiments.
e scientific workflow recommendation methods used for experiments are as follows: (i) LH [18]: this method converts a scientific workflow into a hierarchy incipiently, which manifested as the relationship between scientific workflows and subscientific workflows and activities. us, the similarity assessment between scientific workflows becomes the similarity assessment between the hierarchies.
(ii) LHWT [27]: this method transforms a scientific workflow into a hierarchy incipiently, as described in [18]. Considering tag information of scientific workflow enables labeling of the functional semantics of the scientific workflow in similarity computation. Hence, the tag information utilized the scientific workflow recommendation in this method. (iii) HDSWR: it is our proposed recommendation approach. In our experiments, some parameters for HDSWR are set as follows: α � 0.55, β � c � 0.2, and δ � 0.05.

Comparison with Related Scientific Workflow
Recommendation. As described in Section 6.2, the evaluation metrics are based on SW rec and SW ept , which are affected by parameters rec_K% and exc_K% for our approach. erefore, we study the impact of rec_K% and exc_K% on different recommendation methods with the SW#80 dataset.
To investigate the impact of rec_K% on scientific workflow recommendation precision and recall, the exc_K% is set to 10% and rec_K% is set to 4%, 6%, . . ., 30%, respectively (step size is 2%). As shown in the Figures 5(a) and 5(b), methods HDSWR and LHWT perform higher precision and recall than LH. is is due to some functions being implemented in some scientific workflows, which does not mention in the description of scientific workflows, but in tags [27]. As a result, it is challenging for these scientific workflows to gather into the appropriate clusters. When tag information is considered, these scientific workflows are reaggregated into the appropriate cluster. is demonstrates that function semantics of tags have a great impact on scientific workflow recommendation. Besides, we also discover that HDSWR is superior to LHWT in precision and recall because the HDSWR approach applies metapaths to capture the weak semantics between scientific workflows and thus achieves high-level semantics recommendation, compared to the LHWT method.
When rec_K% is set to be a relatively small value (e.g., 4%, 6%), we detect that the precision and recall of several methods are extremely close.
is indicates that these scientific workflows particularly similar to the sample scientific workflow are recommended to scientists naturally, whatever recommendation methods they are. When rec_K% sets to a relatively large value, the precision of several methods is reduced greatly in Figure 5(a). is is due to the fact that many unrelated scientific workflows are recommended, which do not exist in SW ept . Meanwhile, the recall of several methods is relatively stable in Figure 5(b), for SW ept determined by the exc_K%, and exc_K% is a fixed value. Furthermore, when the rec_K% is 14%, the recall of HDSWR is stable.
is manifests that most expected scientific workflows in SW ept were identified and recommended to scientists through HDSWR. When the rec_K% is 18%, the recall of LHWT is stable, and the recall of LH is stable until the rec_K% is 22%.
Studying the impact of exc_K% on scientific workflow recommendation precision and recall, the rec_K% is set to 10%, exc_K% is set to 4%, 6%, . . ., and 30%. In Figures 5(c) and 5(d), we discovered that the precision and recall of HDSWR are higher than LH and LHWT. Due to the above reason, when exc_K% sets a relatively large value, the scientific workflows in SW ept are abundant, while scientific workflows in SW rec are fixed. erefore, the precision of several methods is stable. However, due to the increasing discrepancy between SW ept and SW rec , the recall of all methods has been declining.
To display the difference in scientific workflow recommendation efficiency intuitively, F 1 is applied to achieve this target. Studying the impact of rec_K% or exc_K% on the recommendation efficiency in Figures 6(a) and 6(b), the  differences between HDSWR, LHWT, and LH are small in the first two groups (i.e., the value of rec_K% is 4% and 6%, respectively); this indicates that scientific workflows most similar to the sample scientific workflow are recommended easily. With the increase of rec_K% or exc_K%, the difference between several methods becomes distinct and the differences between HDSWR and other methods are obvious. Hence, this demonstrates that HDSWR can capture the similarity semantics between scientific workflows effectively and thus promote the reasonable clustering of scientific workflows. When rec_K% exceeds 24% and exc_K% exceeds 22%, the difference of several methods becomes stable, and this indicates that recommendation performance of all methods cannot play a role while excess scientific workflows are recommended.

Detailed Analysis of the Proposed Approach.
In this part, we conduct a series of experiments to analyse the details of our proposed method. 6.5.1. Impact of Clustering Method. As described in Section 5.3, HDSWR requires the DPC clustering algorithm to group scientific workflows into appropriate scientific workflow clusters and assist the scientific workflow recommendation. erefore, the impact of clustering algorithms on scientific workflow recommendation is worth studying. In our previous work [27], the SNN (Shared Nearest Neighbour) clustering algorithm [30] is used for the clustering of scientific workflows. In our study, the DPC clustering algorithm is utilized to cluster scientific workflows to the appropriate scientific workflow clusters. In Figure 7, the performance comparison of two clustering algorithms DPC and SNN on the dataset SW#80 is displayed. e overall recommendation performance ranking is as follows: DPC > SNN, shown in Figure 7. SNN has poor performance, because it takes some data points below the density threshold and points within its domain as noise. Meanwhile, the DPC performs better recommendation performance than SNN.

Impact of the Size of Datasets.
To study the impact of the size of datasets on the recommendation efficiency of several recommendation methods, we conduct a series of experiments with three methods on the dataset SW#236 which has a relatively larger amount of data. e experiment setting is the same with that of the dataset SW#80.
As shown in Figures 8(a) and 8(b), the HDSWR approach has better recommendation performance than other methods, both in the dataset SW#80 with a small amount of data and in the dataset SW#236 with a relatively large amount of data. is proves that the HDSWR approach has good robustness, and the recommendation performance can be effectively improved considering the attribute information of scientific workflows. Besides, we find that the distinction between the recommendation efficiency of the LHWT and HDSWR approaches on the dataset SW#236 is lower than that on the dataset SW#80. 6.5.3. Comparison of the Time Efficiency. To evaluate the time efficiency of the HDSWR approach, we conduct a series of experiments with the datasets of SW#236 and SW#80. Table 3 shows the experiment results of three methods on their average running time (in seconds) with two datasets.
As shown in Table 3, the HDSWR approach has better running time performance than other methods. In fact, the operations of similarity computation occupy most of the running time of three methods, while their operations of clustering need little time. e LHWT method is proposed based on the LH method, which simply appended extra label information for similarity computation. erefore, the LHWT method needs more running time than the LH method. In contrast, the similarity computation operation Complexity adopted by the HDSWR approach is based on the HIN, which is totally different from that of other methods. erefore, it effectively reduces the running time of handling various information for similarity computation.

Conclusion
In this paper, we aim to provide automatic support for the reuse and modelling of scientific workflows. Specifically, we utilize heterogeneous information network as a means of organizing and representing the relations between scientific workflows and consider the objects of tag, description, activity, and subscientific workflow for scientific workflow recommendation. We propose a novel scientific workflow similarity computation method based on metapath. In addition, we present a scientific workflow recommendation approach named HDSWR, where the density peak clustering algorithm is adopted for grouping scientific workflows into clusters and a list of scientific workflows is ranked and recommended according to the requirements of scientists and engineering personnel. As future work, we tend to consider how to apply machine learning methods to automatically tune some parameters on the [32][33][34][35] HDSWR and yield better performance. Furthermore, we will handle related privacy problems in view of the newest research studies [36][37][38][39][40][41].
Data Availability e data sets of our experiments are publicly accessible via the following website: https://github.com/yixinxunwu/ myExperiment.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.