Clustering sequence graphs

In application domains ranging from social networks to e-commerce, it is important to cluster users with respect to both their relationships (e.g., friendship or trust) and their actions (e.g., visited locations or rated products). Motivated by these applications, we introduce here the task of clustering the nodes of a sequence graph , i.e., a graph whose nodes are labeled with strings (e.g., sequences of users’ visited locations or rated products). Both string clustering algorithms and graph clustering algorithms are inappropriate to deal with this task, as they do not consider the structure of strings and graph simultaneously. Moreover, attributed graph clustering algorithms generally construct poor solutions because they need to represent a string as a vector of attributes, which inevitably loses information and may harm clustering quality. We thus introduce the problem of clustering a sequence graph. We first propose two pairwise distance measures for sequence graphs, one based on edit distance and shortest path distance and another one based on SimRank. We then formalize the problem under each measure, showing also that it is NP-hard. In addition, we design a polynomial-time 2-approximation algorithm, as well as a heuristic for the problem. Experiments using real datasets and a case study demonstrate the effectiveness and efficiency of our methods.


Introduction
Graph clustering [1] is a fundamental data mining task, which seeks to partition the nodes of an input graph, so that similar nodes form a group, referred to as cluster.The task is important in application domains such as social networks, where nodes represent users, edges represent friendship relationships between users and clustering aims to detect user communities [2]; or ecommerce, where nodes represent consumers, edges represent trust relationships between consumers and clustering aims to identify groups of consumers with bonds of trust among them [3].The task is also important in medicine, as clustering patient similarity networks (i.e., graphs whose nodes represent patients and edges connect similar patients according to demographics or genetic mutations) allows identifying clinically homogeneous patient groups [4].
However, often in these application domains, the similarity among users does not depend only on user relationships but also on actions or events associated to users.Take for example a geo-social network, such as Foursquare, where users are associated with their history of visited locations.Two connected users may naturally be regarded as more similar in the network, if they have a similar history of visited locations [5,6].For example, similarity with respect to both the sequence of visited locations and the social network structure is considered in recommendation [5] and location prediction [6].Likewise, in e-commerce, two connected users may be regarded as more similar in a trust network, if they have a similar history of rated products [7].Also, in medicine, two connected users may be regarded as more similar (e.g., with respect to disease progression), if they have a similar history of diagnoses [8].

4.
We present a case study on phylogenetic trees [24].A phylogenetic tree is often constructed from a set of strings, each representing the genomic sequence of an organism.It is a hierarchical representation of all clusterings of the genomic sequences of the organisms and is often modeled as a binary tree whose leaf nodes are labeled with the strings.Such a tree can thus be seen as a special type of a sequence graph.Given a phylogenetic tree and a positive integer , the computational task we consider is to cluster the leaf nodes of the tree into  clusters, by taking into account both the tree topology and the sequences corresponding to leaf nodes.We can then evaluate how accurate this clustering is by comparing it to a ground truth clustering.If these two clusterings are similar, then the phylogenetic tree is meaningful.Our results indeed show that the measures we introduce (and the corresponding clustering algorithms) are a reliable way to evaluate whether a given phylogenetic tree is in accordance with a given ground truth clustering consisting of  clusters.On the other hand, we show that four state-of-the-art attributed graph clustering algorithms [12,[16][17][18] are not suitable for this task.The results of the case study also indicate that our methods could potentially be useful in other bioinformatics applications, such as evaluating phylogenetic networks [25].
Paper Organization.The remaining of the paper is organized as follows.In Section 2, we review related work.In Section 3, we define some preliminary concepts.In Section 4 we present our distance measures for sequence graph clustering.In Section 5, we define two sequence clustering problems that we aim to solve and study their hardness.In Section 6, we present our algorithms for addressing these problems.In Section 7, we present an experimental evaluation of our algorithms.In Section 8, we present a case study to showcase the applicability of our algorithms.In Section 9, we conclude the paper.

Related work
Our work is related to sequence clustering and graph clustering, two data mining tasks with several applications [8,[26][27][28].Therefore, in the following, we briefly review algorithms for clustering a collection of sequences (strings), as well as algorithms for clustering a (non-attributed) graph (see [29,30] for surveys).Our work is also related to attributed graph clustering.In an attributed graph, each node is labeled with a vector of attribute values.We therefore review some recent works on attribute graph clustering and refer the reader to [31] for a survey.Last, we discuss the use of string-labeled graphs in bioinformatics.
Sequence Clustering and Graph Clustering.Algorithms for clustering a collection of sequences (strings) measure distance between sequences directly [32], or first project the sequences into a set of patterns (e.g., -grams) and then measure distances on the projected space [10,33].Alternatively, they employ generative models for the input collection of sequences, which are used to obtain likelihood-based distances between sequences [34].In any case, the distance measures are given as input to a clustering algorithm for vector data [30] to obtain clusters.
Algorithms for clustering a graph employ graph partitioning (e.g., they solve a minimum cut problem [35]), spectral clustering [36,37], or cohesive subgraph detection techniques [38].Alternatively, the algorithms in [39,40] learn a node embedding into a vector space, which is then fed into a clustering algorithm for vector data.
Algorithms for clustering a collection of sequences [10,33,34], or for clustering a graph [1,38,39] are not appropriate for clustering sequence graphs, as we show in our experiments (Section 7).This is because they utilize either only the strings (i.e., they cluster the collection of labels in a sequence graph while ignoring the graph structure) or only the graph structure (i.e., they cluster a sequence graph while ignoring its labels), although both the strings and graph structure determine clustering quality.Also, we cannot convert a sequence graph to a graph with string distances as edge weights and then cluster the weighted graph.This is because our clustering problem needs distances between strings of nodes that are not connected, and it is generally not possible to compute these distances exactly by combining edge weights.
An example of a graph-convolution-based algorithm is Adaptive Graph Convolution (AGC) [18].The AGC algorithm uses graph convolution to obtain smooth feature representations of node attributes and spectral clustering.The underlying assumption of AGC is that nodes that are close in the graph will be clustered together [18].An example of a matrix-factorization-based algorithm is Text-Associated DeepWalk (TADW) [12].This algorithm is based on DeepWalk [39], which uses textual attributes to supervise random walks on graphs.Examples of embedding-based algorithms are Text Enhanced Network Embedding (TENE) [16] and Binarized Attributed Network Embedding (BANE) [17].Both of these algorithms aim at learning a low-dimensional vector representation for each node and its associated attributes in the attributed graph.BANE uses Weisfeiler-Lehman graph kernels [41] to encode dependencies between node edges and attributes into a binary code representation.This representation encodes firstorder proximities [42] between nodes.TENE aims to jointly learn vector representations based on both first-order and second-order proximities [42], as well as on the text cluster membership matrix.
Although one can represent string labels as attribute vectors (e.g., by representing a string as a vector of -grams and their frequencies [10] or their tf-idf scores [16]) and then apply an attributed graph clustering algorithm to our problem, such a representation inevitably loses information and thus severely degrades the quality of clustering, as we show in our experiments (Section 7).The reason it loses information is because one needs to assume a single order for the -grams of all sequences to construct the vector representations of all sequences.However, the -grams do not appear in the same order in all sequences.To illustrate this point, we provide the following example.

Example 1.
Let  1 = aba and  2 = bab be two sequences.The 2-grams of both of these sequences are ab and ba and each of these 2-grams appears only once in  1 or in  2 .Thus,  1 and  2 have the same vector representation (1,1) where the first 1 denotes the frequency of ab and the second 1 denotes the frequency of ba, assuming a lexicographic order of -grams.Since  1 and  2 have the same vector representation, they are treated as equal by attributed graph clustering algorithms, although they are not.Similarly, consider a dataset { 1 ,  2 ,  3 } with  3 = ccc.The vector representation of both  1 and  2 when using td-idf scores instead of frequencies is (log 2 (3∕2), log 2 (3∕2)), so again  1 and  2 are treated as equal in similarity computation.Thus, the similarity information of sequences is not captured well after they are represented based on -grams.
Sequence-labeled Graphs in Bioinformatics.In bioinformatics, graphs with sequence-labeled nodes have been used extensively in the following context: the nodes represent (short) DNA fragments read by sequencing technologies; and a weighted (directed) edge (, ) represents the length of the suffix-prefix overlap between sequence  and sequence .The goal is then to assemble these fragments into a candidate genome represented by some trail in the graph [9].Let us stress that this task is not related to clustering.

Background
An alphabet  is a finite non-empty set of elements called letters; we denote its size by ||.
For two positions  and  on  , we denote by  [ . .] =  [] …  [] the substring of  that starts at position  and ends at position  of  .By  we denote the empty string of length 0. We refer to a length- substring of a string  as a -gram (e.g., in Fig. 1, aab is a 3-gram of string aaab).
Let  = ( , ) be a simple graph, 1 where  is a set of nodes and  ⊆  ×  is a set of edges.The set of neighbors of a node Since the nodes are distinct,  has no cycles.A tour is a path that may have cycles.
A sequence graph  = ( , , ,  ) is a tuple, where  is a set of nodes,  ⊆  ×  is a set of edges,  is a set of strings drawn from an alphabet   , and  ∶  →  ∪ {} is a function that outputs a possibly empty string.In particular, each node  ∈  is associated with a string  () ∈  ∪ {}.For example, in Fig. 1,  = {, , , , },   = {, , }, and  ( 1 ) = .A -clustering of  for  ∈ [1, | |] is a partition of  into  subsets, called clusters.We may omit  from -clustering when it is clear from the context.
The edit distance   ( ,  ) between two strings  and  is defined as the minimum number of elementary edit operations (letter insertion, deletion, or substitution) to transform  to  .For  and  of equal length, the Hamming distance   ( ,  ) is defined as the minimum number of substitutions to transform  to  .For example,   (, ) =   (, ) = 1.If  and  are not of the same length, we set   ( ,  ) = ∞, for completeness.
The shortest path distance   (, ) for two nodes  and  of a graph is defined as the length of the shortest path between  and .For completeness, we set   (, ) = ∞, if there is no path between  and .For example, in Fig. 1,   ( 1 ,  3 ) = 1.
Given a constant  ∈ (0, 1), referred to as decay factor, the SimRank score (, ) between  and  is defined as follows [20]: The intuition behind SimRank is that two nodes are similar if they are reachable by similar nodes.SimRank aggregates similarities based on paths.
A metric space (, ) is an ordered pair, where  is a set and  is a metric (also referred to as distance function) on .For example, the set of strings  of a sequence graph  together with the edit distance   , which is a metric [43], define the metric space (,   ).

Distance measures for sequence graphs
In the following, we discuss our distance measures for sequence graphs.

The 𝑑 𝐸𝑆𝑃 measure
Given a sequence graph  = ( , , ,  ), metric spaces (,   ) and ( ,   ), nodes ,  ∈  and strings  (),  (), the   (E is for Edit distance and SP for Shortest Path distance) measure is defined as: The   measure considers the distance between two nodes based on the edit distance between the strings of the nodes and the shortest path distance between the nodes.For example, in Fig. 1,   ( 1 ,  3 ) = √ 2 because   (, ) = 1 and   ( 1 ,  3 ) = 1.That is,   combines the two metric spaces (,   ) and ( ,   ) into a metric space which measures similarity among string-labeled nodes, as if they were points in a 2D space.Note that   is a metric.This is because both edit distance and shortest path distance are metrics [43,44] and   is a 2-product metric [19] on the Cartesian product of the set of strings  and the set of nodes  in a sequence graph  = ( , , ,  ).

The 𝑑 𝐸𝑆𝑅 measure
Our   (E is for Edit distance and SR is for SimRank) measure captures the intuition of SimRank, according to which  and  are similar when they are reachable by similar nodes.However,   differs from SimRank in that it also considers the similarity of the strings of  and  and the strings of their reachable nodes.For example, among two node pairs with the same SimRank score (e.g.,  1 ,  5 and  1 ,  6 in Fig. 1),   treats the node pair having more similar strings with respect to edit distance as more similar (e.g., in Fig. 1,   ( 1 ,  5 ) <   ( 1 ,  6 )).

𝐸𝑆𝑅(𝑢, 𝑣)
where (, ) = (1 − is a similarity score between strings, which is computed by subtracting the normalized edit distance from 1 and multiplying by (1 − ) for a small real number  > 0. This ensures that (, ) < 1 for any pair of nodes (, ), which is required for the iterative computation of ESR (see Theorem 1).Importantly, it also ensures that the multiplication does not substantially change similarity (i.e., the values of (, ) and of (1 −   ( (), ()) max(| ()|,| ()|) ) are nearly equal).Note that ESR cannot be derived from SimRank by setting  = (, ) in Eq. ( 1), since in SimRank  > 0 [20] while (, ) may be 0. Furthermore, ESR is not a simple weighted version of SimRank with  = 1 and weight (, ), since  < 1 in Eq. (1).ESR is also different from SemSim [45], which was developed for attributed graphs and assumes a large value of normalized semantic similarity between any pair of values. 2 Also, it is easy to see that ESR differs from   in that it considers all reachable nodes from  and  as well as their strings, while   considers only the strings of  and .
In the following, we show that a fixed-point iteration method, similar to that developed for SimRank [20], can compute ESR.Specifically, in Theorem 1, we define a function   (, ), for a node pair ,  and an integer  ≥ 0, which quantifies the similarity of  and .Next we prove that   (, ) can be computed iteratively and that the values of   (, ) converge to (, ).Clearly, our measure can also benefit from efficiency optimizations of SimRank (e.g., [46]) and be extended to address the ''zero-similarity'' problem [47].

Proxy measures for 𝑑 𝐸𝑆𝑃 and 𝑑 𝐸𝑆𝑅
Since exact edit distance computation requires quadratic time (assuming the Strong Exponential Time Hypothesis (SETH) is true [49]),   and   are expensive to compute.Therefore, we propose a proxy  Ê and  Ê for   and   , respectively.Instead of considering the strings of a pair of nodes, the proxies consider embeddings of these strings into Hamming distance space.An embedding from a metric space  1 to a metric space  2 is a map of points from  1 to  2 such that distances are preserved up to some factor  known as the distortion.We employ the CGK algorithm [50], which provides a probabilistic embedding with linear distortion.The CGK algorithm runs in linear time, and if the edit distance between two input strings is , then the Hamming distance between their embeddings is between ∕2 and ( 2 ) with good probability [50].By incorporating CGK instead of edit distance in our algorithms, we substantially improve their efficiency without substantially degrading effectiveness, as it will be shown in Section 7.

Sequence graph 𝒌-Center and 𝒌-Median
We present two clustering problems for sequence graphs inspired by the -Center and the -Median problems [21][22][23].

Sequence graph 𝑘-Center (SGC)
The SGC problem, defined in Problem 1 below, requires finding  nodes, referred to as centers, such that the maximum distance of any node to a closest center with respect to   is minimized.
A clustering { 1 , … ,   } of  is obtained from a solution  to SGC, by assigning into each cluster   ,  ∈ [1, ], a center   ∈  and also every node  ∉  that is closer to this center compared to other centers (i.e., every  such that   (,   ) <   (,   ), for each  ≠ ).If a node is in equal distance from multiple clusters, it is assigned into an arbitrarily selected cluster containing one of these centers.
The decision version of SGC asks whether there exists a subset  ⊆  of  nodes such that max ∈ min ∈   (, ) ≤ , for a given real number .
We show that SGC is NP-hard by showing that its decision version is NP-complete.We prove this result via a reduction from the decision version of metric -Center [22], which is known to be NP-complete [22].
Problem 2 (Metric -Center (Decision Version) [22]).Given a metric space ( , ), where  is a set of  points and  ∶  ×  → R + is a distance function, an integer  ∈ [1, ], and a real number , decide whether there exists a subset  ⊆  of  points such that max ∈ min ∈ (, ) ≤ .

Lemma 1. The decision version of SGC is NP-complete.
Proof.The decision version of SGC is clearly in NP.In the following, we show that it can be reduced from the decision version of metric -Center.Given any instance   of the decision version of metric -Center, we construct an instance   of the decision version of SGC in polynomial time, by creating a sequence graph  = ( , , ,  ) as follows: (I) We add a node  into  , for each point  ∈  , where  is the set of points in   .(II) We set  = ∅.(III) We set  = {}.(IV) We define  such that  () = , for each  ∈  .We also set  and  in   to  and  in   , respectively, and   (( (),  ()), (, )) = (,  ′ ), for every pair (,  ′ ) ∈  corresponding to pair (, ) ∈  .This completes the construction of   .
In the following, we prove that   has a positive answer if and only if   has a positive answer.(⇒) If   has a positive answer, there is a subset  ⊆  of  nodes such that max ∈ min ∈   (, ) ≤ .These nodes correspond to a subset  ⊆  of  points such that max ∈ min ∈ (, ) ≤ .Thus,   has a positive answer.(⇐) If   has a positive answer, then there is a subset  ⊆  of  points such that max ∈ min ∈ (, ) ≤ .These points correspond to a subset  ⊆  of  nodes such that max ∈ min ∈   (, ) ≤ .Thus,   has a positive answer.□ Due to Lemma 1, we obtain Theorem 2 directly.The NP-hardness of the variant follows from a similar reduction to that of Lemma 1.The main change is that we reduce from the decision version of the -center problem in which  is a dissimilarity function (not necessarily a metric) [21] and that we use   instead of   .

Sequence graph 𝑘-Median (SGM)
The Sequence Graph -Median (SGM) problem requires finding  nodes, referred to as representatives, such that the sum of distances between nodes and their closest representative with respect to   is minimized.
Then, a clustering { 1 , … ,   } is obtained from  as in the SGC problem.We prove that SGM is NP-hard by reducing the decision version of SGM from the decision version of the metric -Median problem, which is NP-complete [23].The decision version of SGM asks whether there exists a subset  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) ≤ , for a given real number .
Problem 5 (Metric -Median (Decision Version) [23]).Given a metric space ( , ), where  is a set of  points and  For brevity, we denote max (, ′ )∈ (,  ′ ) + 1 by .The division with  ensures that a value in [0, 1] is assigned to   and is needed because   takes values in [0, 1], whereas  takes values in R + .Clearly,  can be computed in polynomial time.Last, we set  in   2 to   .This completes the construction.In the following, we prove that   2 has a positive answer if and only if   has a positive answer.(⇒) If   2 has a positive answer, then there is a subset  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) ≤ .These nodes correspond to a subset  ⊆  of  points such that ∑ ∈ min  ′ ∈ (,  ′ ) ≤  , since   (, ) = (, ′ )  , for every pair of nodes (, ) ∈  corresponding to a pair of points (,  ′ ) ∈ , and  =   .Thus,   has a positive answer.(⇐) If   has a positive answer, then there is a subset  ⊆  of  points such that ∑ ∈ min  ′ ∈ (,  ′ ) ≤  .These points correspond to a subset  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) ≤ , since (,  ′ ) =   (, ) ⋅ , for every pair of points (,  ′ ) ∈  corresponding to a pair of nodes (, ) ∈ , and  =  ⋅ .Thus,   2 has a positive answer.□

Algorithms for clustering sequence graphs
We present algorithms for the SGC and SGM problems.These algorithms are presented with   but they can also use   , or the proxies of these measures, instead (see Section 7).

Approximation algorithm for SGC
We begin by showing a polynomial-time approximation-preserving reduction from SGC to (the optimization version of) metric -Center [22] defined below.This allows approximating SGC within a 2-factor.[22]).Given a metric space ( , ), where  is a set of  points and  ∶  ×  → R + is a distance function, and an integer  ∈ [1, ], find a subset  ⊆  of  points such that max ∈ min ∈ (, ) is minimized.

Lemma 4. SGC can be reduced to metric 𝑘-Center.
Proof.Given any instance   of SGC, we construct an instance   of metric -Center in polynomial time, as follows: (I) We construct a set  of points by adding into an initially empty set  a point , for each node  ∈  in the graph of   .(II) We set  and  in   to  and  in   , respectively.(III) We set (,  ′ ) =   (( (),  ()), (, )), for every pair of nodes (, ) ∈  corresponding to a pair of points (,  ′ ) ∈  .This completes the construction of   .
In the following, we prove the correspondence between a solution   to   and a solution   to   .
( Proof.The statement follows from Lemma 4.
We also show that SGC cannot be approximated within a 2 −  factor, for any  > 0, by reducing metric -Center to SGC.
Theorem 5. SGC cannot be approximated within a 2 −  factor, for any  > 0.
Proof.Metric -Center cannot be approximated within a 2 −  factor for any  > 0 [22].Thus, it suffices to reduce -Center to SGC.Given any instance   of metric -Center, we construct an instance   of SGC in polynomial time, as follows.First, we create a sequence graph  = ( , , ,  ) as follows: (I) We add a node  into  , for each point ∈  , where  is the set of points in   .(II) We set  = ∅.(III) We set  = {}.(IV) We define  such that  () = , for each  ∈  .We also set  in   to  in   , and   (( (),  ())) = (,  ′ ), for every pair (,  ′ ) ∈  corresponding to pair (, ) ∈  .This completes the construction of   .
The correspondence between a solution   to   and a solution   to   holds due to Lemma 4. □ Therefore, we develop SGC-APPROX, a 2-approximation algorithm for SGC which is based on the algorithm of Gonzalez [51] for metric -Center.Note that, by Theorem 5, it is not possible to design a polynomial-time approximation algorithm for SGC with better approximation ratio than that of SGC-APPROX.Add  into the cluster in  whose center  has minimum   (, ) 11: end for 12: return  Our algorithm works as follows (see Algorithm 1 for the pseudocode).It adds an arbitrary node  into an initially empty set of clusters  and into an auxiliary set  that will contain the selected centers (Lines 1 to 3).Then, it performs  − 1 iterations (Lines 4 to 8).In each iteration, it finds a node that is as far as possible from its closest node in  with respect to   .This node is selected as a center and is added into  and into .After that, SGC-APPROX constructs and returns a clustering comprised of a center and its closest nodes with respect to   , breaking ties arbitrarily (Lines 9 to 12).
The approximation guarantee of SGC-APPROX is given below: Proof.The approximation guarantee of SGC-APPROX follows from that the algorithm of Gonzalez [51] has an approximation factor of 2 for the metric -Center problem and from Theorem 4.

Heuristic for SGM
We propose SGM-HEUR (see Algorithm 2 for the pseudocode), a heuristic based on an efficient and effective version of the medoids algorithm [54].Our heuristic selects  nodes with a smallest score ∑ ∈   (,) ∑  ′ ∈   (, ′ ) , treats each such node as a cluster representative (medoid), and adds each other node into the cluster of its closest representative (Lines 1 to 8).After that, it updates the representatives and the clusters, as -medoids does, until the clusters do not change or  iterations are performed (Lines 9 to 21).Last, it returns the set of clusters (Line 22

Experimental evaluation
In this section, we evaluate our algorithms with respect to effectiveness and efficiency.We also demonstrate that ESR and Ê and hence   and  Ê converge fast.Evaluated Methods.We tested SGC-APPROX and SGM-HEUR with   ,  Ê ,   , or  Ê .We report results for SGC-APPROX with   (referred to as CA) and with  Ê (referred to as CA  ), as well as for SGM-HEUR with   (referred to as MH) and with  Ê (referred to as MH  ).The use of   and  Ê in SGC-APPROX did not substantially help quality but reduced efficiency, while the use of   and  Ê in SGM-HEUR did not substantially help efficiency but reduced quality; thus we omit these results.In all our algorithms, we used a normalized version of   , where   and   is divided by its maximum value, and a similarly normalized version of  Ê .We compared our algorithms against four state-of-the-art attributed graph clustering methods which employ different techniques (see Section 2): (I) Text-Associated DeepWalk (TADW) [12], (II) Text Enhanced Network Embedding (TENE) [16], (III) Binarized Attributed Network Embedding (BANE) [17], and (IV) Adaptive Graph Convolution (AGC) [18].
To use these methods, we first constructed the set  = ∪ ∈   , where   is the set of -grams of a string  ∈ , and then embedded the string  () of each node  ∈  into an attribute vector   such that   [] is equal to: (I) 1, if  () contains the -gram with lexicographic rank3  in , and 0 otherwise, (II) the frequency in  () of the -gram with lexicographic rank  in , or (III) the tf-idf score in  of the -gram with lexicographic rank  in .Note that such embeddings have already been used in the literature, for instance, in [10,16,33].We report results for the best embedding method for each competitor and .The real-valued embedding constructed by TADW or TENE was fed into -means, following [55], while the binary embedding of BANE was fed into -medoids [54] (with Hamming distance).We also implemented variants of CA and MH that represent a string using a vector of -grams, as TADW and TENE do, reporting results for the best embedding method and .We refer to the variant of CA (respectively, MH) as CA vec (respectively, MH vec ).Methods for clustering strings (-medoids [54], -means [1]) or graphs (spectral clustering [1]) performed worse than [12,16,17]; thus we omit their results.We summarize all tested methods in Table 1, for ease of reference.Datasets and Setup.We used the Ciao (CIAO) and Epinions (EPIN) datasets from https://www.cse.msu.edu/~tangjili/datasetcode/truststudy.htm.In these datasets, each user (node) is associated with a (potentially empty) sequence of reviewed products (string), and an edge connects users who trust each other.Table 2 summarizes the characteristics of the datasets we used.
Clustering quality was quantified using: (I) the Average Sum of Pairwise Edit distances (ASPE), defined as ; and (II) Modularity [2], a well-established measure of network (graph) clustering quality, expressed as the fraction of the edges that fall within the given clusters minus the expected fraction if edges were distributed at random.Small ASPE and large Modularity values are preferred.The speed of convergence of ESR (see Theorem 1) was measured using Average Relative Difference (ARD) [45], defined as , where  and  + 1 are two consecutive iterations of the fixed-point iteration algorithm.ARD was also used with Ê (defined as Ê(, ) = 1 −  Ê (, )) and SimRank.By default, we used  = 15,  = 2,  = 10 −9 , and  = 5.We also used the default value  = 300 [54] and default parameter values for competitors.The proxy measures were implemented following [56].All results are averaged over 10 runs.All algorithms were implemented in Python and executed on an Intel i9 at 3.70 GHz with 64 GB RAM.Our source code is available at https://rebrand.ly/SGcode.String vs. Vector Representation.We show that representing strings as attribute vectors has a negative impact on clustering quality via comparing our algorithms to the variants CA vec and MH vec .As can be seen in Figs. 2 and 3, CA vec resulted in higher (worse) ASPE than both CA and CA  .For example, the average ASPE for CA vec was higher than that of CA (respectively, CA  ) by 104% (respectively, 56%) on average (over the two datasets).Also, CA vec resulted in lower (worse) Modularity than both CA and CA  .For example, the average Modularity for CA vec was lower than that of CA (respectively, CA  ) by 18% (respectively, 33%) (over the two datasets).These results show the benefit of measuring similarity by using strings directly as our algorithms do.Since CA vec and MH vec performed worse than CA and MH, we do not report results for them in the remaining of this section.
Clustering Quality.We show that our algorithms created clusters containing nodes with similar strings, as ASPE is lower than that of most competitors (see Fig. 4(a)), which are also structurally similar, as Modularity is higher than that of most competitors  (see Fig. 4(b)).In addition, the use of proxy measures by our algorithms did not substantially affect clustering quality, as CA  and MH  performed very similarly to CA and MH, respectively (see Fig. A.8 in Appendix A).On the other hand, the competitors achieved a worse clustering than our algorithms, when considering both ASPE and Modularity together, and some were worse in both of these measures.For example, TENE performed poorly in terms of both ASPE and Modularity, while BANE and TADW performed the worst in terms of Modularity and worse than MH and MH  in terms of ASPE.AGC performed the worst in terms of ASPE but the best in terms of Modularity for  < 36.For  ≥ 36, AGC did not terminate because the eigenvalue decomposition it employs to find the convolution failed.Similar results were obtained for the EPIN dataset (see Figs. 4(c) and 4(d)).
The reason that AGC performed poorly with respect to ASPE is that it favors Modularity by design, since it assumes that nodes that are close in the graph will likely be clustered together, as mentioned in Section 2. This assumption is not necessarily true.In fact, in our setting, AGC created clusters comprised of nodes that are close in the graph but have quite dissimilar strings and this led to poor clusters in terms of ASPE.The reason that TADW performed poorly with respect to Modularity is that it supervises random walks based on attribute vectors, which resulted in clusters with nodes that are far apart in the graph.The reason that BANE (respectively, TENE) performed poorly in terms of Modularity is that it does not use higher than first (respectively, second) order proximities to capture the distance of nodes in the graph.However, such proximities are important to consider [42] because good clusters may be constructed based on higher than second order proximities.The good performance of our methods is due to three factors: (I)   and   can capture graph distance and string distance in a unified manner.(II) Unlike the competitors, our methods use the strings directly in similarity measurements instead of representing strings as attribute vectors, which may lose similarity information.(III) Unlike TENE and BANE, which only use first or first and second order proximities, our methods employ measures that capture distance between two nodes based also on longer paths.
Note that the values of ASPE and Modularity depend on the similarity between strings and the similarity between nodes, respectively.Thus, it may not be possible to create a clustering with both low ASPE and high Modularity, when close nodes have different strings and vice versa.
Besides, we also examined the impact of  (length of -gram used in the vector representation of a string by the competitors).Figs. 5 (c)-(d) show that cluster quality in EPIN became worse as  increased.This is because the number of distinct -grams (i.e., ||) increases and thus ASPE increased as well.Thus, the default value  = 2 we used is a fair choice.The results for CIAO were similar (see Figs. 5 (a)-(b)).
Convergence.We show that ESR and Ê converge faster than SimRank, which helps efficiency (see Fig. 6) This is attributed to the impact of string similarity (e.g., (, ) = 0 implies  +1 (, ) = 0 in Eq. ( 3)).In fact, the ARD scores for ESR and Ê were smaller than 10 −3 after 5 iterations, while those for SimRank were an order of magnitude larger even after 20 iterations.
Runtime.We examined the runtime of all methods for varying number of nodes (see Fig. 7) CA  and MH  were much more efficient than all competitors.For instance, CA  was up to 8 and 3 times faster than TENE, the fastest competitor, in the experiment of Fig. 7(a) and 7(b), respectively.As expected, CA  and MH  were faster than CA and MH, since the last two algorithms need to compute edit distance instead of the more efficient to compute proxy measures.For example, MH  was two orders of magnitude faster than MH in the case of clustering CIAO and even faster in the case of clustering EPIN.In addition, CA  was approximately two times faster than CA.The impact of using the proxy measure in the algorithms for SGC was less significant compared to the algorithms for SGM, because in the former there are fewer edit distance computations, as in SGC there is no need to compute all pairwise distances between strings.CA was faster than MH, as expected by the complexity analysis (see Section 6).For example, CA was more than 50 times faster than MH in the case of the CIAO dataset and more than two order of magnitudes faster in the case of the EPIN dataset.

Case study: Clustering phylogenetic trees
As discussed in Introduction, an edge in a sequence graph can model a relationship between users or a relationship between strings.In Section 7, we demonstrated the effectiveness of our approach in applications where edges represent relationships between users and the input data are modeled as a graph.We now proceed to presenting a case study that highlights the effectiveness of our approach when edges represent relationships between strings and the input data are modeled as a phylogenetic tree [24].
In particular, we consider the domain of bioinformatics and the application of evaluating the quality of a phylogenetic tree [24].A phylogenetic tree is a rooted or unrooted leaf-labeled bifurcating (binary) tree that represents evolutionary relationships among biological organisms [24].A phylogenetic tree can be inferred from a set of strings, each representing the genomic sequence of an organism.Each leaf of the tree corresponds to a different organism and is labeled with a string representing the genomic sequence of the organism, while each non-leaf node  corresponds to a cluster comprised of all strings associated with the leaves of the subtree rooted at .Thus, a phylogenetic tree is a hierarchical representation of all clusterings of the strings of its leaves.
A phylogenetic tree  whose leaves correspond to a set of strings  can be constructed by different methods (e.g., by agglomerative hierarchical clustering methods [24]).To evaluate the quality of  , one can compare it with a ground truth clustering  of .Let  be the number of clusters in .Clearly,  cannot be compared with  directly, since the former is a binary tree (i.e., a 2D structure), whereas the latter is a partition of  into  clusters (i.e., a 1D structure).Therefore, one needs to first ''flatten''  , by creating a clustering  ′ of its leaves that has  clusters, and then compare  ′ with .If these two clusterings are similar,  is of high quality, as it accurately reflects the evolutionary relationships between the organisms corresponding to the strings of  according to the ground truth clustering.
It is easy to see that  can be modeled as a sequence graph  = ( , , ,  ) with  (respectively, ) being the set of nodes (respectively, edges) of  ,  being the set of strings  corresponding to the leaves of  (i.e., the leaf labels), and  being a function that associates each leaf of  with is corresponding string in  and each non-leaf node in  with the empty string.Thus, we can construct  ′ by first clustering the sequence graph corresponding to  with  equal to the number of clusters in the ground truth clustering, and then creating, for each resultant cluster , a cluster  ′ in  ′ that is comprised of the non-empty strings corresponding to the nodes in .After that, we can compare  ′ with the ground truth clustering  using measures that compare clusterings (e.g., the measures in [57][58][59]).
In what follows, we first discuss the data and setup we used and then the results of our case study.
Data Setup.We used three datasets, referred as Ebolavirus (EBOL), Influenza (INFL), and Coronavirus (COR).The characteristics of these datasets are summarized in Table 3.In these datasets, each record is a genomic sequence of a different virus type (e.g., in EBOL, there are 59 records and each record corresponds to a different type of Ebolavirus).All genomic sequences were downloaded from the NCBI GenBank [60] based on their accession numbers provided in [61].
For each dataset, we obtained the phylogenetic tree from [62], and the ground truth clustering from the NCBI GenBank [60] using the BioPython library [63].Specifically, each cluster in the ground truth clustering is comprised of all sequences with the same value in the Organism field, for EBOL and INFL, or all sequences with the same value in the last element of the Taxonomy field for COR (since all sequences in this dataset had the same value in Organism).It can be readily verified from Figs. B.9, B.10, and B.11 in Appendix B that the phylogenetic trees we used are in accordance with the ground truth clustering.That is, each ground truth cluster contains leaves that are close together in the phylogenetic tree.We constructed a clustering  ′ from the sequence graph corresponding to a phylogenetic tree  by applying one of our methods (CA, CA  , MH, and MH  ) or a competitor (TADW, TENE, BANE, AGC), configured as in Section 7.
To measure similarity between  ′ and the ground truth clustering, we used three well-established measures that compare similarity between two clusterings based on their labels: Clustering Accuracy (ACC) [57], Normalized Mutual Information (NMI) [58], and macro- 1 score [59].These measures take values in [0, 1] with larger values indicating a more accurate (i.e., closer to the ground truth) clustering.
ACC is computed based on Eq. ( 16): where  is the set of strings in the input sequence graph,   is the ground truth label of the th string in the input sequence graph,   is the id of the cluster where this string belongs in  ′ that is used as clustering label, () is the optimal mapping function that permutes clustering labels to match the ground truth labels, 4  ) , (17) where   denotes the number of strings in the th cluster in  ′ , n denotes the number of strings belonging to the th cluster in the ground truth clustering ,   is the number of strings belonging both in the th cluster in  ′ and in the th ground truth cluster, and  is the number of clusters.
The macro- 1 measure is based on the  1 measure. 1 assumes a setting where there are only two different labels, namely 0 and 1, in the ground truth clustering, and it is computed based on Eq. ( 18): where Prec =     +  and Rec =     +  .In turn,   denotes the number of strings with label 1 in both the ground truth clustering and  ′ , while   (respectively,   ) denotes the number of strings with label 1 (respectively, 0) in the ground truth clustering and 0 (respectively, 1) in  ′ .When the ground truth clustering contains  labels, the macro- 1 measure is defined based on Eq. ( 19): where  1 () is the  1 score obtained for a two-label setting, in which 1 is the label of cluster  and 0 is the label of any other cluster.We used the default parameters of Section 7 for all methods.All experiments ran on the PC mentioned in Section 7.
Clustering Quality.Since the phylogenetic trees we used are in accordance with the ground truth (see Figs. B.9, B.10, and B.11 in Appendix B), we expect that a good clustering method for evaluating a phylogenetic tree would have a high value (close to 1) in ACC, NMI, and macro- 1 .The higher the value, the better the clustering method, as the clustering it constructs is more similar to the ground truth clustering.Tables 4, 5, and 6 show that the clusterings created by our methods are substantially more similar to the ground truth clustering compared to those created by the competitors.CA was the best performing method, outperforming MH in all tested cases, due to its objective function.Specifically, its ACC, NMI, and macro- 1 scores were 15.6%, 15.2%, and 16% larger on average (over all datasets) than those of MH, respectively.In addition, the use of proxy measures in our methods did not substantially affect clustering quality.This is encouraging as our methods using proxy measures, namely CA  and MH  , are more efficient, as discussed in Section 7.
On the other hand, the competitors did not perform well.For example, the ACC, NMI, and macro- 1 scores for CA were 91.8%, 159.5%, and 121.2% larger on average (over all datasets) than the best competitor TADW.The main reason for the poor performance of competitors is that, in the application we consider, only the leaves of a phylogenetic tree have non-empty strings associated with them, while a large number of non-leaf nodes are associated with the empty string.This leads the competitors to construct clusters with leaf nodes associated with different strings, as their assumptions (close nodes in the tree should be clustered together for AGC, and low order proximities are sufficient to cluster nodes in the tree for BANE and TENE) are invalidated.Last, note that in the case of EBOL and COR, BANE did not produce  clusters, since it learned fewer than  binary code representations.

Conclusion
This work introduced the problem of clustering sequence graphs and studied variants of the problem based on the -center and -median problems.We first proposed a product metric and a SimRank-based measure to capture distance between two nodes of a sequence graph, as well as a proxy for each measure.We then proposed an approximation algorithm and a heuristic, which generally outperform attribute-based graph clustering methods, as shown experimentally.Last, we proposed a methodology that successfully applies our measures (and the corresponding clustering algorithms) to evaluate whether a given phylogenetic tree is in accordance with a given ground truth clustering.

Theorem 2 .
SGC is NP-hard.Proof.The statement follows from Lemma 1. □ We also consider the following variant of SGC which uses   instead of   : Problem 3 (Sequence Graph -Center (SGC) with   ).Given a sequence graph  = ( , , ,  ) and an integer  ∈ [1, | |], find a subset  ⊆  of  nodes such that max ∈ min ∈   (, ) is minimized.

Lemma 3 .
The decision version of Problem 6 is NP-complete.Proof.Given any instance   of the decision version of -Median, we construct an instance   2 of the decision version of Problem 6 in polynomial time, by creating a sequence graph  = ( , , ,  ) as follows: (I) We add a node  into  , for each point  ∈  , where  is the set of points in the decision version of -Median.(II) We set  = ∅.(III) We set  = {}.(IV) We define  such that  () =  for each  ∈  .We also set  in   2 to  in   , and set   (, ) = (,  ′ )∕(max (, ′ )∈ (,  ′ ) + 1), for every pair (,  ′ ) ∈  corresponding to pair (, ) ∈  , where  ∶  ×  → R + is the dissimilarity function in the decision version of -Median.

Fig. 4 .
Fig. 4. Comparison of our algorithms against state-of-the-art algorithms for attributed graph clustering: (a) ASPE and (b) Modularity vs.  for CIAO.(c) ASPE and (d) Modularity vs.  for EPIN.

Fig. 6 .
Fig. 6.Convergence speed for our measures and SimRank: ARD vs.  (number of iterations of fixed-point iteration algorithm) for: (a) CIAO and (b) EPIN.

Fig. 7 .
Fig. 7. Efficiency for our methods and the competitors: Runtime vs. % of nodes for: (a) CIAO and (b) EPIN.

Fig. B. 10 .
Fig. B.10.Phylogenetic tree of EBOL with the ground truth.The sequences are shown as leaves.Each cluster in the ground truth clustering is represented with a differently colored rectangle.
an integer  ∈[1, ], and a real number  , decide whether there exists a subset  ⊆  such that ∑  to  in   , and   (, ) = (,  ′ ), for every pair (,  ′ ) ∈  corresponding to pair (, ) ∈  .Last, we set  in   to  .This completes the construction.In the following, we prove that   has a positive answer if and only if   has a positive answer.(⇒) If   has a positive answer, then there is a subset  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) ≤ .These nodes correspond to a subset  ⊆  of  points such that ∑ ∈ min ∈ (, ) ≤  .Thus,   has a positive answer.(⇐) If   has a positive answer, then there is a subset  ⊆  of  points such that ∑ ∈ min ∈ (, ) ≤  .These points correspond to a subset  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) ≤ .Thus,   has a positive answer.□ Due to Lemma 2, we obtain Theorem 3 directly.The statement follows from Lemma 2. We also consider a variant of SGM which uses   instead of   : Problem 6 (Sequence Graph -Median (SGM) with   ).Given a sequence graph  = ( , , ,  ) and an integer  ∈ [1, | |], find a set  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) is minimized.The decision version of Problem 6 asks whether there exists a subset  ⊆  of  nodes such that ∑ ∈ min ∈   (, ) ≤ , for a given real number .Below, we provide a reduction from the decision version of -Median [21], which implies that Problem 6 is NP-hard.
∈ min ∈ (, ) ≤  .Lemma 2. The decision version of SGM is NP-complete.Proof.Given any instance   of the decision version of -Median, we construct an instance   of the decision version of SGM in polynomial time, by creating a sequence graph  = ( , , ,  ) as follows: (I) We add a node  into  , for each point  ∈  , where  is the set of points in   .(II) We set  = ∅.(III) We set  = {}.(IV) We define  such that  () =  for each  ∈  .We also set  in  ⇒) If   is a solution to   , then there is a subset  ⊆  of  points such that max ∈ min ∈ (, ) is minimum.These points correspond to a subset  ⊆  of  nodes such that max ∈ min ∈   (, ) is minimum.Thus,   is a solution to   .(⇐) If   is a solution to   , then there is a subset  ⊆  of  nodes such that max ∈ min ∈   (, ) is minimum.These nodes correspond a subset  ⊆  of  points such that max ∈ min ∈ (, ) is minimum.Thus,   is a solution to   .□ Theorem 4. SGC can be approximated within a factor of 2, for any  > 0.

Table 1
Summary of the methods used in experiments.

Table 4
ACC for different methods applied with  = 5 on EBOL and INFL, and with  = 9 on COR.A × denotes that a method did not produce a score, because it did not produce  clusters.The values for the best performing method are in bold.

Table 5
NMI for different methods applied with  = 5 on EBOL and INFL, and with  = 9 on COR.A × denotes that a method did not produce a score, because it did not produce  clusters.The values for the best performing method are in bold.

Table 6
Macro- 1 for different methods applied with  = 5 on EBOL and INFL, and with  = 9 on COR.A × denotes that a method did not produce a score, because it did not produce  clusters.The values for the best performing method are in bold.
and () outputs 1 if its argument is true and 0 otherwise.NMI is computed based on Eq. (17):