Phylostratigraphic analysis shows the earliest 1 origination of the stress associated genes in A . thaliana 2

Phylostratigraphic analysis is a way to look anew on phylogenetic data in the evolutionary aspect. 20 It allows counting the evolutionary age based on the analysis of genes, their orthologs and finding the last 21 common ancestor. We performed phylostratigraphic analysis of Arabidopsis thaliana genes associated with 22 several types of abiotic stresses (heat, cold, water-related, light, osmotic, salt, and oxidative) determined by 23 the Gene Ontology annotation. Comparison of the distributions of ages of genes associated with stresses of 24 different type has shown the heat stress to involve older genes while the light stress – younger genes. At 25 the same time, all types of stress are characterized by a significantly higher proportion of old genes 26 (common to all eukaryotes) compared to the whole set of A.thaliana genes. This can be explained by the 27 involvement of basic molecular processes in plant cells into the stress response. Reconstruction and 28 graphical analysis of the gene network of the heat stress educed several clusters associated with different 29 response functions. Some of these clusters contain only ancient genes. The results obtained show that the 30 phylostratigraphic analysis reveals the fundamental features of the organization of gene networks and their 31 evolution. 32 33

other genes. For instance, the study [5] reported that human genes referred to such phylostrata as Cellular 59 organism and Eukaryota, are generally associated with basal cellular functions (metabolic processes, 60 transcription regulation), while the genes originated in the later stages of evolution are associated with the 61 genes of the immune response and reproduction. In plants, older genes are also associated primarily with 62 fundamental cellular processes (photosynthesis, RNA transcription and processing, primary metabolism), 63 and younger genes are associated with secondary metabolism, hormonal regulation, and transcription 64 regulation [9]. Expression of the youngest genes of Arabidopsis thaliana showed a bias to mature pollen, and 65 was enriched in a gene co-expression module that correlates with mature pollen [10].

66
Since the molecular basis of the phenotype of the organism are gene networks, the study of the Physcomitrella patens (moss) demonstrated, that genes from the same evolutionary period tend to be 70 connected, whereas old and young genes tend to be disconnected and the modules of the same age emerged 71 at a specific time in plant evolution [9].

72
Previously, we developed a Cytoscape application Orthoscape for analysis and visualization of the 73 ages of genes in the context of the structure of their gene networks [11]. In the present study, the Orthoscape 74 was used to analyze genes associated with seven types of abiotic stress in A. thaliana. Stress response in 75 plants is crucial for their adaptation to environment and evolution. Genetic systems for responding to 76 abiotic stresses have been studied in model plants such as rice and Arabidopsis [12,13]. These systems 77 consist of coordinately functioning genes, they have a level of evolutionarily plasticity, and their 78 composition may significantly change in the process of evolution due to the large role of segmental and 79 full-genome duplications in plants [14]. Using the Orthoscape application, we carry out phylostratigraphic 80 analysis of genes of plant stress, including the assessment of the distributions of these genes according to 81 the evolutionary age as well as the reconstruction and graphical visualization of gene networks by the 82 example of the network of the heat stress response. Our results have shown that the genes associated with 83 different types of stress differ in age. However, in general, the response to stress involves a significant 84 proportion of old genes. Graphical analysis of the reconstructed gene network of the heat stress has 85 demonstrated its modular organization where some modules are represented by age-homogeneous genes.

86
The study demonstrates that the use of phylostratigraphic analysis allows to obtain new interesting data 87 on the evolution of genes of stress response in plants.

97
In the first step, extended lists of GO terms associated with each type of stress were formed. To do 98 this, we selected all the terms that contained the keyword "stress" in either title or description, as well as 99 all their child terms. After the formation of the initial list, its refinement was carried out, the terms GO not 100 associated with this type of stress were removed. As a result, we have selected 161 terms that characterize 101 particular types of stress. Subsequent analysis showed that the lists of terms associated with the keyword 102 'water' and 'drought' were substantially overlapped: 25 terms were associated with the keyword 'water' 103 and 10 with 'drought', 6 terms were common. Therefore, these two lists in our analysis were combined 104 under the name "water-related stress".  genes. Therefore, these taxa were excluded from further analysis. It should also be noted that the list of the 130 genes described in KEGG contains 32690 elements (http://rest.kegg.jp/list/ath). However, only those genes 131 have been selected for the analysis, for which at least one annotation term was found in the Gene Ontology 132 database. This list included 25843 genes, and below it is assumed as the background A. thaliana genes list.

133
As a result, such a reduction allows us to take into account the fact that younger genes are less annotated 134 in the GO.

153
A list of genes associated with different types of stress is given in the Supplementary file 2. For each 154 of the types of stress we have identified no less than 100 genes (minimum, 102, genes for heat stress; 155 maximum, 232, genes for salt stress). Interestingly, there was no significant linear correlation between the 156 number of GO terms and the number of genes associated with these terms (Pearson correlation coefficient 157 between these values was found to be 0.09).

158
Identification of genes from the resulting list in the KEGG database allowed us to find almost all genes 159 corresponding the TAIR annotations: for most lists of genes, only 1-6 genes were not detected; only for the 160 list of the light stress 13 genes were excluded. other hand, large fraction of gene sets have a number of genes in common with salt stress dataset (5 out of 170 6 types have more than 10% of common genes with this type of stress). However, the majority of 171 comparisons yield less than 10% of common genes (28 out of 42). The fraction of unique genes for datasets 172 is lower than 50% for only one type of stress, osmotic (30%), for three datasets it is greater than 70%, for 173 other three datasets it is greater than 50% (table 2).

174
Therefore, we will analyze the seven types of gene sets separately, however bearing in mind that some 175 pairs of gene sets may overlap quite remarkably.

231
** Highest value among all stress datasets for specific identity threshold.

233
The opposite case is the gene set related to light stress. This sets of genes has the largest average PAI 234 values at 3 of 5 thresholds used (table 4). This is again in agreement with data shown in Fig. 1

241
In the previous section, it was shown that stress gene lists contain both genes unique for each type of

288
The fourth cluster contains 13 genes, 10 of which are the only genes added to the initial heat stress 289 gene set by STRING. Genes from this cluster have no connections to other clusters via STRING interactions.

290
They tightly interconnected within the cluster. All of genes included in this cluster are proteasomal genes.

291
It is likely that the function of this cluster is related to the degradation of proteins unfolded due to the heat 292 stress.

293
We combined the rest four genes outside clusters 1-4 into the fifth cluster. It contain pair of genes  shown that the involvement of some genes in several stress responses is one of the features of stress genes.

325
This was most noticeable for such stress as osmotic. More than 60% of the genes involved in responding to 326 this stress are also involved in responding to other stress (

351
some of them were identified as members of the heat stress response network (Fig. 3). One of these proteins,

367
These results are in good agreement with data from Ruprecht et al [9], who showed that general 368 biological processes, such as photosynthesis, glycolysis, DNA synthesis and others were already present in 369 the ancestors of green plants. In the study above, based on the analysis of rice and moss genes it was shown to the Brassicaceae family. The authors showed that new genes are more likely to exhibit differential 376 expression in the conditions of plant response to stress (compared to other genes). These results, however,

377
do not contradict ours. Although we have shown that stress response gene networks include a significant 378 portion of ancient genes, some young genes are also involved in these networks. These young genes may 379 be involved into regulatory modules of gene networks (Fig. 3), as well as in the system of sensitivity to 380 stressors and therefore primarily respond to changes in its expression in response to external factors.

381
We have shown that there are differences between the ages of genes involved in different types of 382 stress. Thus, the genes of the response to the heat stress contain the largest proportion of ancient 383 representatives and the lowest values of PAI values. This is most likely due to the involvement in this 384 response of such ancient families as chaperones and proteasomal proteins, which represent a significant 385 proportion of all proteins in the set (Fig. 3).

386
The response to the light stress is characterized by the presence of younger genes, the average PAI 387 values for them is the highest (Table 4). For this type of stress, the high value of the proportion genes 388 belonging to such phylostrata as Trachaeophyta (vascular plants), Magnoliophyta (flowering plants) and

389
Brassicaceae is observed (Fig. 1). As for the first two phylostrata, we can suggest that the high value of the 390 proportion of the light stress genes for them may be due to the fact that notably vascular plants are