A topological characterization of DNA sequences based on chaos geometry and persistent homology

Methods for analyzing similarities among DNA sequences play a fundamental role in computational biology, and have a variety of applications in public health, and in the field of genetics. In this paper, a novel geometric and topological method for analyzing similarities among DNA sequences is developed, based on persistent homology from algebraic topology, in combination with chaos geometry in 4-dimensional space as a graphical representation of DNA sequences. Our topological framework for DNA similarity analysis is general, alignment-free, and can deal with DNA sequences of various lengths, while proving first-of-the-kind visualization features for visual inspection of DNA sequences directly, based on topological features of point clouds that represent DNA sequences. As an application, we test our methods on three datasets including genome sequences of different types of Hantavirus, Influenza A viruses, and Human Papillomavirus.


I. INTRODUCTION
The last few decades have witnessed a surge in the growth of methods that are devoted to analyzing the similarities among DNA sequences to obtain the corresponding genetic information.Despite these diverse methods, they can be classified into graphical representation of DNA sequences and other techniques based on numeric representations.Methods based on graphical representation of DNA sequences have contributed significantly to the general area of DNA similarity analysis.One of the main ideas used in graphical representation is to realize each nucleotide base, say A, C, G, T as a point in a Euclidean space, normally, R n for some small values n between 2 and 5.One then obtains a collection of points in R n that represents a DNA sequence.Based on these collections of points, many papers are devoted, in combination with other techniques, such as signal processing (see, for example, [1]), to compare similarities among DNA sequences.The most frequently used similarity measures for analyzing differences or similarities between DNA sequences based their corresponding graphical representation are Euclidean distances or correlation angles.For papers that adopt graphical representation for DNA similarity analysis, the reader is referred to, for example, references [2]- [21].
In addition to graphical representations, numeric representations are also used for DNA similarity analysis, which maps each DNA sequence to a sequence of numbers.In order to compare differences or similarities among DNA sequences, other ad hoc methods are utilized to compare their corresponding number sequences.The reader is referred to, for example, [22]- [29], and their references therein for these non-graphical representation based methods.
A common theme in graphical representation based methods is to rely on simply constructed geometric object, say curves in Euclidean spaces that represent DNA sequences to apply for DNA similarity analysis.Although there were apparent successes in using such methods, the employed techniques pose some technical difficulties: (i) curves constructed via graphical representations may not truly represent the geometry of the corresponding DNA sequences due to degeneracy or selfcrossings (ii) these methods may lead to poor performance due to not being able to effectively deal with short and long DNA sequences simultaneously.In an attempt to overcome these technical difficulties, our paper provides a completely novel method for DNA similarity analysis based on a combination of graphical representation with tools from algebraic topology in particular persistent homology (see [30] for this notion).Our method also employs a graphical representation by first transforming each DNA sequence into a collection of points in a Euclidean space R n which can be viewed as point clouds in topological data analysis.Instead of using simple geometrybased methods for analyzing these data points, we apply tools from algebraic topology such as persistent homology to obtain topological signatures of these point clouds that are signified via their persistence diagrams.Each DNA sequence, thus, is in correspondence with its unique persistence diagrams which encode its topological signatures such as how many connected components, or how many 1-dimensional holes are present in the topology of DNA sequences.Using the wellknown Wasserstein distance (which will be reviewed later) for such persistence diagrams, our key observation is that the similarities among DNA sequences are reflected by the Wasserstein distances among their corresponding persistence diagrams.One important feature of our method is its highly distinctive visuality of geometry and topology of DNA sequences.In other words, the DNA sequences, in many cases, can be immediately distinguished by visualization of their corresponding persistence diagrams which are collections of points in R 2 .Another outstanding feature in our method is that it can effectively deal with various lengths of DNA sequences.Our topological framework is general, alignment-free, and can deal with DNA sequences with only partial genome information while providing first-of-the-kind visualization features.
The rest of the paper is organized as follows.In Section II, we introduce a new higher dimensional (4-dimensional) representation of DNA sequence based on chaos geometry.This new 4-D representation will be used for building our topological representation of the DNA sequences.In Section III, we review some basic notions related to persistent homology and formally introduce our method in Section IV.We apply our method for analyzing three datasets in Section V, and compare with other state-of-the-art methods such as Clustal Omega (see [31]).

II. CHAOS 4-DIMENSIONAL REPRESENTATION
In this section, we employ 4-dimensional chaos (see [32]) to transform a DNA sequence into a finite set of points in R 4 that can be viewed as a 4-dimensional representation of the DNA sequence.To the best of our knowledge, this is the first time that chaos in higher dimensional space has been adopted to encode DNA sequences.Note that chaos game in 2-dimensional space was already used to represent DNA sequences in previous work (see, for example, [33]).More specifically, let α be a DNA sequence of length n of the form where Set e 1 = (1, 0, 0, 0), e 2 = (0, 1, 0, 0), e 3 = (0, 0, 1, 0), and e 4 = (0, 0, 0, 1), the four standard unit vectors in R 4 .Using the 4-dimensional chaos, we construct the finite set of points X α in R 4 , consisting of points b 1 , . . ., b n ∈ R 4 as follows.
(i) b 1 = e a1 ; and Intuitively, b k is the k-th point in the 3-simplex that is chosen to be the midpoint of the line segment that connects the (k − 1)-th point and the vertex e ak .The map that transforms each DNA sequence

DIAGRAMS
In this section, we recall the notion of persistent homology and persistence diagrams that we be used in our method for analyzing DNA sequences.One of the main references for persistent homology is [30] (see also [34] for a detailed account of persistent homology).

A. Homology groups of a simplicial complex
Let k be a nonnegative integer, and let u 0 , . . ., u k be k + 1 points in R k+1 .A k-simplex σ generated by {u 0 , . . ., u k } is the convex hull of {u 0 , . . ., u k }, i.e., the set consisting of all convex combinations of these points that is given by Throughout this paper, we denote by [u 0 , . . ., u k ] the ksimplex generated by the points u 0 , . . ., u k .
Intuitively, a 0-simplex is a point in R, a 1-simplex is a line in R 2 , and a 2-simplex is a triangle in R 3 (see Figure for illustration of these simplices).
The convex hull of any subset of {u 0 , . . ., u k } with d + 1 elements is also a d-simplex, and is called a face of σ.

Definition III.2. (simplicial complex)
A simplicial complex Δ is a collection of simplices such that whenever σ is a simplex in Δ, all the faces of σ are contained in Δ.
A key notion for constructing persistent homology of a set of points in R n is a formal sum of j-simplices.A formal sum of j-simplices is an object of the form h a h σ h , where the a h are real numbers in R such that all but finitely many a h are zero, and the σ h are j-simplices.Two formal sums h a h σ h and h b h σ h can be added in a natural way as Equipped with this natural addition "+", the collection of all formal sums h∈H a h σ h becomes a group-a mathematical structure in Algebra in which one can add, and subtract its elements in a similar way as the real numbers do.
Let Δ be a simplicial complex in R n .For j ≥ 0, the j-th chain group C j (Δ) is the group of all formal sums of jsimplices h a h σ h , where the σ h are j-simplices in Δ.Each formal sum in C j (Δ) is a j-chain.
There is a natural map ∂ j which can send an element in C j (Δ) to C j−1 (Δ) by removing one point from the generating set of a formal sum, and taking the alternating sum of them.It suffices to give the equation of ∂ j for each j-simplex [u 0 , . . ., u j ] in Δ that is given by Thus one obtains the map, called the j-th boundary map The j-th chain group C j (Δ) contains two important subgroups Z j (Δ) and B j (Δ).The former, Z j (Δ), called the jth cycle group, consists of all j-chains σ in C j (Δ) such that ∂ j (σ) = 0.For example, for any three distinct points u The latter, B j (Δ), called the j-th boundary group, consists of all j-chains ∂ j+1 (σ), where σ varies over the (j +1)-th chain group C j+1 (Δ).The fundamental theorem of homology implies that Z j (Δ) ⊂ B j (Δ).
The following is one of the most important notions that we will use in our method.

Definition III.3. (homology groups)
Let Δ be a simplicial complex in R n .For each j ≥ 0, the j-th homology group of Δ is the quotient group H j (Δ) = B j (Δ)/Z j (Δ).

B. Persistent Homology Groups and Persistence Diagrams
In this subsection, we recall the notion of persistent homology groups generated by a finite set of points in R n .Let X be a finite set of points, say u 1 , . . ., u d in R n , and let d denote the standard Euclidean distance in R n .For any α ≥ 0, let VR(X; α) be the collection of all subsets σ of X such that the Euclidean distance between any two elements in σ is at most α, that is, It is easy to verify that VR(X; α) is a simplicial complex called the α-Vietoris-Rips complex of X, and thus one can construct homology groups {H j (VR(X; α))} j≥0 as in Subsection III-A.Now take an increasing sequence of nonnegative real numbers, say {α i } i≥1 with α i < α i+1 for all i.Set α 0 = −∞.
and VR 0 = VR(X; α 0 ) = ∅.Then one obtains a filtration of the form where α s is large enough such that all pairs of points in X are within α s .For each 0 ≤ p ≤ q ≤ s, there is a natural map ∂ p,q j : H j (VR p ) → H j (VR q ) induced from the embedding VR p ⊆ VR q .We denote by Im(∂ p,q j ) the image of the map ∂ p,q j .Definition III.4.(persistent homology groups) For each j ≥ 0, the j-th persistent homology groups of X are the images H p,q j (X) = Im(∂ p,q j ) for 0 ≤ p ≤ q ≤ s.By a j-topological feature of X, we mean an element γ in H p,p j (X) for some 0 ≤ p ≤ s.The j-th persistence diagram of X is a set of points {b, d) |0 ≤ b < d}, where each point (b, d) signifies the birth and death times of a j-topological feature γ of X, i.e., b is the radius in which γ first appears in VR b and d is the radius in which γ gets filled in with a lower dimensional simplex.We denote by PD j (X) the jth persistence diagram of X.In our methods, it suffices to consider only the 0-th and 1-th persistence diagrams, which correspond to topological features of connectedness and 1dimensional holes of X, respectively.
Let X, Y be two finite sets of points in R n .In order to compare topological features of X, Y in our methods, we consider the Wasserstein distance of degree 1 between PD 1 (X) and PD 1 (Y ), i.e., where

IV. PROPOSED METHOD
Our proposed method for reconstructing a phylogenetic tree of DNA sequences is described in the following algorithm: (0) (Input) A collection of n DNA sequences α 1 , . . ., α n .
(1) Construct Chaos 4-dimensional Representation (C4DR) of each DNA sequence α i to obtain a finite set of points X αi in R 4 (see Section II). ( 2) Compute the 1st persistence diagrams of the X αi to obtain the sets PD 1 (X αi ) in R 2 , using the notions in Section III.We use Python packages from https://pypi.org/project/persim/ to compute persistence diagrams and Wasserstein distances.(3) Compute the distance matrix of dimensions n × n whose (i, j)-entry is the Wasserstein distance ) (Ouput) Construct the phylogenetic tree of the DNA sequences from the distance matrix in Step 3, using UPGMA algorithm (see [35]).
Remark IV.1.In the proposed method, and our experimental analysis, for Steps 2 and 3 above, we also try to use both the 0th and 1st persistence diagrams of X αi ), and hence use both Wasserstein distances W 0 for 0th persistence diagrams and W 1 for 1st persistence diagrams to construct the phylogenetic tree of the DNA sequences.The resulting phylogenetic trees are very similar to those constructed using only 1st persistence diagrams.Intuitively 0th persistence diagrams only provide information about how many connected components there are from the topological space induced from cloud points, and thus they do not provide a great deal about the topological structure of the cloud points as 1st persistence diagrams.So in the proposed method above, and in our experimental analysis that we will describe in next section, we only use 1st persistence diagrams and the Wasserstein distance W 1 to construct the phylogenetic trees in order to save computational time.

V. RESULTS
In this section, we apply our method described in Section IV to analyzing three datasets: Human Papillomavirus (HPV), Hantavirus, and Influenza A virus.All of the datasets are obtained from GenBank1 , which is the NIH genetic sequence database.In addition, the accession numbers of HPV and Influenza datasets can be found in [36], the accession number of Hantavirus dataset can be found in [37].

A. Human Papillomavirus (HPV)
We first consider the dataset of HPV.Human Papillomavirus is mostly responsible for cervical cancer which is the second most common cancer among women (see [38]).We apply our method on the dataset of 400 HPV genomes that consist of 12 genotypes 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, and 66.Note that among these genotypes, there are low risk HPV types such as 6 and 11, and high risk HPV types such as 16 and 18.These high risk HPV types are responsible for about 70% of cervical cancer (see [39]).Thus it is an important problem to accurately classify HPV into low and high risk types.In addition, a reasonable method should be able to identify HPV genotypes when only partial genomes are available.Our proposed method in Section IV have all the above features, and is suitable for efficiently classifying HPV genotypes.In addition it allows one to visually inspect differences between HPV genotypes.For example, Fig. 1 illustrates identical persistence diagrams of subtypes 11 and 15 of the same HPV genotype 6 whose highly identical visualization shows that they should belong in the same HPV genotype.On the other hand, the distinctive visualization between subtype 44 of HPV genotype 16 and subtype 18 of HPV genotype 18 in Fig. 2 shows that they should belong in different genotypes of HPV.In fact the Wasserstein distance between them is 1.548-the maximum distance between any two genome sequences of HPV from the dataset.
From the phylogenetic tree of HPV genomes based on our method in Fig. 5, our method accurately classifies these HPV genomes into their corresponding genotypes.Note that in [1], their CGR method misplaced one genome of HPV type 11, and they further showed that Clustal Omega (see [31]) is able to classify the dataset of HPV correctly.

B. Hantavirus (HV)
Hantavirus, named after Hanta River in South Korea, is a recently discovered RNA virus in the family Bunyaviridae.HV may infect humans, and some strains of HV can cause possibly fatal diseases in humans such as Hantavirus hemorrhagic fever with renal syndrome (HFRS) or hantavirus hemorrhagic fever with renal syndrome (HPS).In Eastern Asia, the type of HV that causes HFRS, mainly include Hantaan (HTN) and Seoul (SEO) viruses.In Western European, Russia, and Northeastern China, Puumala (PUU) is the type of HV that causes HFRS.There is another genus of the family Bunyaviridae that is called Phlebovirus (PV).We apply our method on the data set of 34 HV genome sequences consisting of 4 different types HTN, SEO, PUU, and PV.The name of these strains are included in the phylogenetic tree (see Fig. 3).From Figure 3, we find that our method accurately cluster CGRN strains together whose host is Rattus norvegicus.Similarly the method correctly clusters four CGAa strains together whose host is Apodemus agrarius.In addition, two CGHu strains whose host is Homo sapiens, is also grouped together.
Except strain Sotkamo (type PUU) forms an independent leave, we find that all leaves are classified accurately into their corresponding types.But branches of the phylogenetic tree constructed by our method contain both types as sub-branches which are quite different from the results in [40]- [42].

C. Influenza
Influenza A viruses are very dangerous because they have a wide range of hosts including birds, horses, swine, and humans.These viruses have been a serious health threat to humans and animals (see [43]), and are known to have high degree of genetic and antigenic variability (see [44], [45]).Some subtypes of Influenza A viruses are very lethal including H1N1, H2N2, H5N1, H7N3, and H7N9.We apply our method on the dataset consisting of 38 nfluenza A virus genomes.From Fig. 4, we find that except A/emperor goose/Alaska/44297-260/2007 (H2N2) and A/turkey/VA/505477-18/2007 (H5N1), the tips (or leaves) of Fig. 3: Phylogenetic tree of Hantavirus genomes the phylogenetic tree of segment 6 of Influenza A virus genomes are clustered correctly into their types.However, for type H1N1 genomes, our method clusters them into 3 distinct subbranches of the tree.We will address this problem in an ongoing work.