Research on the Evolutionary Path of COVID-19

World Health Organization reported that a novel coronavirus (COVID-19) was identified as the causative virus of pneumonia of unknown etiology first reported by Chinese authorities on 7 January, 2020. Our objectives were to discover the evolutionary path and evolutionary characteristics of COVID-19. As time goes on, the evolution of COVID-19 has shown a variety of trends. In this paper, we proposed a method of constructing the evolutionary network of COVID-19 in chronological order, and analysed the evolutionary tree and haplotype of COVID-19.


Data
The first genome sequence of COVID-19(MN908947.3) was published on January 10, 2020. All the results are based on the comparative analysis and identification with MN908947.3.
We choose 22 genome sequences of COVID-19, which are isolated from humans; high quality and the length are same as MN908947.3. Due to the reason of permission, we just get the 22 genome sequences from GenBank and Genome Warehouse. The 22 genome sequences came from China, Japan, Greece, U.S.A, Brazil and some other countries. Table 1 is the detailed information of all 22 genome sequences.  . We construct the matrix from the first gene sequence in chronological order. With the development of time, if the gene sites in the same position changes, it is considered that the mutation occurs; the corresponding position of the matrix is set to 1. The more sites change in one gene sequence, the larger the corresponding matrix element value.

Evolutionary tree.
Evolutionary tree [9], also named phylogenetic tree, is a tree that shows the evolutionary relationship between species that are considered to have common ancestors. It is used to represent the results of phylogenetic research and is a method to describe the correlation between different organisms in bioinformatics. Phylogenetic analysis can help people understand the evolution 3 of all organisms. There is a lot of software to draw evolutionary trees, e.g. MEGA (Molecular Evolutionary Genetics Analysis) [10], PhyML (http://www.atgc-montpellier.fr/phyml/), FastTree (http://www.microbesonline.org/fasttree/), BEAST (Bayesian evolutionary analysis by sampling trees) [11] and so on. In this paper, the evolutionary tree of COVID-19 was drawn by MEGA and BEAST.

Haplotype analysis.
Haplotype is the abbreviation of haploid genotype. Haplotype is the combination of a series of genetic mutations that coexist on a single chromosome, each of which has its own unique haplotypes [12].
During meiosis, recombination between homologous non sister chromatids can produce new haplotypes. Haplotypes contain a complete set of genetic information, which is the basis of describing individual genome and an essential aspect of genome research.
The haplotype network map can reveal the genetic distance and evolution relationship between different virus haplotypes, and the root node and leaf node indicate the variation and evolution direction of virus.
When the amount of data is large enough and the sampling randomness is good enough, the haplotypes with a large number of viruses indicate a large population, suggesting that this type of virus spreads quickly in the population.

The Analysis of Evolutionary Path of COVID-19
3.1.1. The changes of each site. First, we counted the changes of each site. There are 87 sites changes. For a single site, the number of changed genomes ranges from 1 to 9. There are 62 sites only have one change. In order not to affect the readability of Figure 1, we only reserve the sites where two or more genome sequence changes, as shown in Figure 1. That means some sites are conservative and not easy to change, while others are not.

The matrix of evolutionary path.
Second, we construct the matrix of evolutionary path of COVID-19. The more sites change in one gene sequence, the larger the corresponding matrix element value, as shown in Figure 2. The numbers in Figure 2 represent the corresponding genome sequence in Table 1. The first genome sequence of COVID-19 (MN908947.3) is the centre of the whole evolutionary network of COVID-19. We build networks in chronological order, so the first genome sequence is at the centre of the whole network. If more sites of a genome sequence change, it means  that it is far away from the first gene sequence. As shown in Figure 2, with the change of time, there are more and more genomes with site changes.
COVID-19 is RNA virus. It is more likely to mutate. There are many reasons for the genetic material of the virus to change. Many chemical and physical factors can be used to induce mutation, such as nitrous acid, hydroxylamine, and high temperature and so on. But at present, But, the information we get is limited. We only analyse it from the point of view of topology according to the order of time.

Evolutionary Tree
Based on the available high-quality sequence variation of COVID-19, evolutionary trees were constructed by UPGMA and Bayesian evolutionary analysis.
In this paper, the evolutionary tree of COVID-19 was drawn by MEGA ( Figure 3) and BEAST (Figure 4). In Figure 3, the tree scale is 0.0001, which represents the number of differences between sequences (e.g. 0.1 means 10 % differences between two sequences). In Figure 4, the tree scale is 0.05, which represents days before the time of lastly sampled genomes by scale*365.

Haplotype Analysis
Haplotype network of COVID-19 genomes were used to reveal the genetic distance and evolution relationship between different COVID-19 haplotypes.
With the development of time, there are more and more virus mutations and the haplotype network of virus genome will become more and more complex, which will lead to the unrecognizability of human eyes. Therefore, we only selected the data of China up to 2020-1-27 to display, corresponding to 1 and 3, genome sequences in Table 1.
Circle (node) represents a haplotype. Nodes of different colours represent different provinces in Figure 5. The number in the node indicates the number of COVID-19 belonging to this haplotype. The line represents the distance between the two haplotypes, the longer the line, the farther the distance (the greater the difference). As of 2020-01-17, 57 strains of virus have been sampled, with a total of 30 haplotypes. The first virus genome is one of 18 genomes in the middle of Figure 5, and the third genome is the green one in Figure 5.

Conclusions
For humans, COVID-19 is an unknown virus. Its understanding is based on the hard work of countless researchers. Just as the understanding of the disease caused by the virus is constantly refreshing, so is the understanding of the virus. As we study the virus, it is mutating and evolving. Hope to provide help for clinical treatment and vaccine development, in this paper, we proposed a method of