Evolutionary divergence in TSS usage comparing Bos taurus and Bos indicus
Firstly, statistical analysis of the reproducibility of TSSs from CAGE within and across sub-species when different number of biological replicates (1 or 2), coverage of samples (total or half tags), and minimum tag count thresholds (5, 10, 15, 20 and 25) for distinct positions were investigated for TSS calling. For this purpose, fetal lung, adult liver and lung, respectively with low, medium and high sample coverage were randomly divided into two sub-samples and TSSs called in each. The TSS called in each half sample were then compared. Based on these results, using a fixed minimum tag count threshold irrespective of the number of replicates and sample coverage will result in inconsistency in number of TSSs obtained in each tissue and make the comparison between tissues difficult (Supplementary Table 1). To improve signal confidence while taking into account the effect of number of replicates and sample coverage, the minimum tag count threshold (\({t}_{i}\)) was set using the formula \({N}_{max}\times \frac{{TPM}_{i}}{{N}_{i}}\), where \({N}_{max}\) is the number of biological replicates for a tissue having the highest replicates among the all tissues, e.g here \({N}_{max}=3\), \({TPM}_{i}\)is total tags per million CAGE tag counts in sample i, \({N}_{i}\) is the number of biological replicates for sample i. Also, for samples with high coverage such as kidney and lung (\({t}_{i}\)> 25), to avoid missing a noticeable number of true TSSs, the \({t}_{i}\) was set to 25 tag counts. A minimum reproducibility of 76%, measured as the fraction of genes with the same number of TSSs in the total and half samples, was obtained when the coverage was reduced to less than one million CAGE tag counts for fetal lung (Supplementary Table 2). This could be a result of the reduction of \({t}_{i}\) to three tags, and therefore it resulted in reducing the signal confidence in fetal lung sub-samples. In total, based on these results, the reproducibility of TSS is principally a consequence of the depth of sampling in individual replicates. Considering the relatively good sample coverage in the current study, the average reproducibility across samples was higher than 80%.
To study TSS evolution, we compared TSS positions and the distribution of the cluster of tags within TSSs across three tissues, including liver, muscle, and spleen in adult Bos taurus and Bos indicus. The number of TSSs in each tissue is shown in Table 1. The genes were divided into four groups based on the differences in the number of TSSs: 1) genes with single TSSs in both sub-species, 2) genes with multiple TSSs in both sub-species, 3) genes with divergent single/multiple TSSs across sub-species, i.e. a single TSS in taurus and multiple TSSs in indicus or vice versa, and 4) extreme cases in which a single or multiple TSSs were observed in one sub-species but no TSS were observed in the other. In total, 20% (liver), 43% (muscle), and 30% (spleen) of all genes were in the first group, while only a minority of genes (3–9%) belonged to the second group. Interestingly, about 12% (spleen and muscle) and 10% (liver) of the total genes had divergent single/multiple TSSs. The fourth group of extreme cases in which single or multiple TSSs were observed in one species but no TSS were observed in the other ranged between 77% and 48% of genes in liver and muscle, respectively.
Table 1
Summary of the number of TSSs for each sample along with the biological replicate, number of CAGE tags and number of genes with single or multiple TSSs
Tissue
|
Biological replicate
|
CAGE tags
|
TSSs
|
Genes with TSSs
|
Total
|
In known genes
|
In promoter, proximal, and 5’ UTR of known genes
|
Total
|
Single TSS
|
Multiple TSSs
|
Bos indicus
|
Adult stage
|
Blood
|
1
|
2688190
|
5974
|
5733
|
5389
|
4352
|
3549
|
803
|
Kidney
|
1
|
13530755
|
14190
|
13469
|
12186
|
8449
|
5774
|
2675
|
Muscle
|
1
|
4873320
|
7234
|
6916
|
6310
|
4843
|
3770
|
1073
|
Ovary
|
1
|
6298852
|
9680
|
9377
|
8595
|
6444
|
4854
|
1590
|
Spleen
|
1
2
|
3693393
|
8211
|
7994
|
7622
|
5814
|
4418
|
1396
|
4077550
|
Thyroid
|
1
2
|
11418581
|
8503
|
8283
|
7930
|
6030
|
4541
|
1489
|
4077550
|
Uterus
|
1
|
7478607
|
7125
|
6872
|
5956
|
4660
|
3717
|
943
|
Liver
|
1
|
5897856
|
11928
|
11421
|
10694
|
7804
|
5654
|
2150
|
Lung
|
1
|
24258474
|
15151
|
14586
|
12279
|
8611
|
6084
|
2527
|
Fetal stage
|
Liver
|
1
|
7981054
|
7498
|
7164
|
6641
|
5293
|
4204
|
1089
|
Lung
|
1
|
1647510
|
11562
|
11199
|
10470
|
7640
|
5549
|
2091
|
Bos taurus
|
Adult stage
|
Heart
|
1
2
|
11874030
|
3355
|
3194
|
2958
|
2509
|
2150
|
359
|
11795755
|
Liver
|
1
2
|
6267975
|
5046
|
4864
|
4502
|
3704
|
3065
|
639
|
6070584
|
Mammary
|
1
2
3
|
21558896
|
2495
|
2359
|
2199
|
1949
|
1741
|
208
|
17327550
|
8235522
|
Muscle
|
1
2
|
3967633
|
6784
|
6523
|
5996
|
4729
|
3754
|
975
|
4181101
|
Spleen
|
1
2
|
7876366
|
5555
|
5331
|
5063
|
4275
|
3618
|
657
|
9644352
|
Fetal stage
|
Liver
|
1
2
|
10777732
|
3595
|
3427
|
3153
|
2704
|
2330
|
374
|
9998986
|
Further analysis of genes with single TSSs in both sub-species confirmed that TSS positions on the genome were not fixed, and TSSs mostly (~ 75–92%) lay within < ± 10 nucleotides of the other (Fig. 1). However, about ~ 50% of TSS in genes with divergent TSSs across sub-species had a TSS within < ± 20–25 nucleotides in the other sub-species. A noticeable number of these genes (~ 30–40%) had TSSs that had been translocated by > ± 25 and < ± 50 nucleotides from the TSS in the other sub-species. This is consistent with previous evolutionary research between human and mouse that suggested that the signals encoding TSSs are highly flexible and evolvable25. In line with evolutionary research from human and mouse23, 26 suggesting two types of evolutionary pathways lead to TSS turnover, we observed sliding of the TSSs along the genome and gradual shifting of usage from one TSS to an alternative TSS in the other sub-species. It seems that degradation of a weak TSS or an emerging new one over evolutionary time was likely responsible for observing the variability in the number of TSSs for a specific gene across the sub-species (Fig. 2, bar charts).
A previous study in humans and mice8 classified TSSs into four “shape” categories using four different criteria, which were applied in a specific order. 1) A TSS was assigned as a single peak (SP) shape if the distance between the 75 and 25 tag density percentile within a TSS (i.e. interquartile range (IQR)) was less than 4 bp; 2) If the ratio between the first and second tag peak was > 2 and the TSS was not classed as SP, it was classified as a broad with dominant peak (PB); 3) If the TSS was not classified as SP or PB and there was one or more consecutive tag density 5th percentile pairs with a distance exceeding 10 bp a TSS was classified as bi- or multimodal (MU); 4) If none of the above applied, the TSS was classified as broad (BR). That study8 used a genomic interval, the ratio between the first and second tag peak, and a distance between the tags for the classifying TSS, but in the current study to better visualize the probability distribution of the cluster of tags within a TSS we classified TSS into four shape categories based on the genomic interval covered by the cluster of tags and the probability of the tags distribution within a single TSS. In the single dominant peak class (SP), the tags were concentrated to no more than four consecutive start positions. The clusters spanning a broader region (IQR > 4 bp) were further divided into three categories based on the shapiro test for normality and Hartigans' Dip test for unimodality in R packages: 1) If the P-value for shapiro test > 0.05 a TSS was classified as normal broad distribution (NB); 2) If the P-value for the Dip test was < 0.05 a TSS was classified as a bi- or multimodal broad distribution (MB); 3) If none of the above applied, the TSS was classed as a general broad distribution (BR). Generally, in both sub-species, the first and second highest fraction of TSSs were observed in SP and BR shape groups, and taurus in comparison with indicus tended to have higher frequency of SP TSSs except for genes with divergent TSS in spleen and muscle in which taurus and indicus had single and multiple TSSs, respectively (Fig. 3). Furthermore to visualize the TSS width, we plotted the frequency of TSSs based on the genomic interval covered by the cluster of tags within the TSS in genes with divergent and multiple TSSs across sub-species separately for each tissue (see Fig. 2 for histogram charts). Based on these results, TSSs in indicus tended to be wider, spread over a larger genomic region, in genes with divergent TSSs in which indicus and taurus sub-species had single and multiple TSSs, respectively (Fig. 2 left panel).
A crossover TSS event was identified if the dominant TSS, the TSS with the highest expression, switched between the two sub-species for a specific gene. To assess crossover switching events, the expression of TSSs was compared for genes having multiple TSSs in both sub-species. Our hypothesis is that differential TSS usage in different sub-species can result in significant regulatory changes. In total 56 (0.64%), 78 (1.14%), and 97 (1.67%) genes had crossover TSS switching events between taurus and indicus in liver, spleen, and muscle, respectively (Supplementary Table 3). Significantly enriched biological process (BP) and gene ontology (GO) terms for these crossover switching genes in the three tissues included catabolic and metabolic process in muscle and liver and positive regulation of T cell activation and phospholipid binding in spleen (P-value < 0.05). Based on these results, one reason for the noticeable heterosis which is observed in Bos taurus × Bos indicus crossbreds could be that there are more functional TSSs in crossbreds compared to the pure species. This mechanism for heterosis would have to confirmed by demonstrating a dominance effect on fitness.
Characterization of TSSs architecture in heat shock protein-related genes
Bos taurus breeds are best suited to sub-tropical and temperate regions. They have thicker coats that allow them to endure cooler winters, and they do not have the notable ‘hump’ of their Bos indicus relatives. Bos indicus cattle in contrast have large ears and dewlap, which help to keep them cool and are well-suited to tropical environments. With the hypothesis that heat shock proteins may be involved in this adaptation, ten heat shock protein-related genes such as HSP family genes, including HSP32, HSP60, HSP70, HSP90, and HSP105 (Supplementary Table 4) and two apoptotic-related genes (BAX, and BCL2) were analysed for variation in TSS expression and number, and the distribution of the tag clusters within TSS in adult tissues between and within the sub-species. In indicus, except for BAX, BCL2, HSPA14, and HSPH1 which had single or multiple TSSs across some of the tissues and no TSS in the others, the remaining genes were expressed in all tissues. However, in taurus most of the genes had no TSS across all or some of the tissues, except for HSPA5, HSP90B1, and HSP90AA1 (Supplementary Table 5). Generally, more variation was seen in the number of TSSs in indicus compared to taurus sub-species (t-Test; P-value < 0.05). A previous study 28 profiling the genome-wide TSSs in lungs of C57BL/6 mice 24 h after intratracheal instillation of the multiwalled carbon nanotube Mitsui-7, which is associated with increased stress and inflammatory response, revealed that many of the Mitsui-7-responsive TSSs were alternative TSSs for known genes, and the number of alternative TSSs used for a given gene was positively correlated with overall Mitsui-7 response. These researchers suggested that individual TSSs may simply be additive, or may have expression output limitations that make transcription of additional TSSs favourable, also they suggested that having multiple TSSs could enable a faster response to stimuli28. Similarly, HSP genes have been extensively reported to respond to external stress stimuli29, so observing more variation in the number of TSSs in indicus sub-species may indicate their higher ability to respond to stress. Future research exploring a time series response could shed light on the latter hypothesis.
All the expressed HSP genes had a higher total TSS expression level in indicus compared to taurus sub-species (t-Test; P-value < 0.05). The highest TSS expression was observed for gene HSP90B1 in both sub-species (Supplementary Table 6). A heat map of the transcript expression for the most common TSS (core TSS) in genes related to heat stress across the indicus adult tissues (Fig. 4A) and analysis of the core TSS shape (fraction of broad or sharp TSS for each gene across tissues; Fig. 4B) demonstrated association between TSS expression and peak shape in different genes. Core TSS, the most common TSS among all tissues, in the three highest expressed genes (HSP90B1, HSP90AA1, and HMOX1) were observed to have sharp, well defined TSS peaks (Fig. 4).
Variation in TSS number between developmental stages within each sub-species
Like any specific tissue in the body, the biological features of tissue in fetal and adult stages might be determined mainly at the level of gene expression and transcription. So differential and quantitative analysis of TSS expression and distribution patterns could be useful for the identification of developmental-stage-specific genes and facilitate our understanding of the mechanisms of coexistence of a gene’s transcription pattern in fetal and adult stages. TSSs were detected in adult and fetus of two indicus and one taurus tissues, with 7,498 (11,928) and 11,199 (15,151) detected in fetal (adult) liver and lung indicus tissues and 3,427(4,864) in fetal (adult) taurus liver, respectively (Table 1). In the indicus and taurus liver, the proportion of TSSs that matched known genes was the same (~ 96%), and about 93% (94%) and 92% (93%) of those were located in the promoter, proximal region, or 5’ UTR of genes at the fetal (adult) stage in indicus and taurus liver, respectively. In the Bos indicus lung, 97% (96%) matched known genes, and 93% (84%) were located in the promoter, proximal region, or 5’ UTR at the fetal (adult) stage. In the pilot Encyclopaedia of DNA Elements study, it was shown that there are long transcripts that can bridge genes or even span several genes, often starting in the middle of a gene structure30. A small proportion of TSSs observed within introns could be the result of recapping due to post-transcriptional modifications30. In the current study, only TSS located in the promoter, proximal, and 5’ UTR regions of known genes were used for all analyses.
Developmental-stage TSS changes in Bos taurus and indicus liver
In both sub-species a higher number of genes had a single TSS in the fetal stage compared to the adult (Fisher’s exact test, P-value < 0.0002). In total, 79% (72%) and 86% (83%) of genes in the fetal (adult) indicus and taurus liver had a single TSS, respectively (Fig. 5A, and B). This may be due to the use of less complex expression functions in the fetal stage compared to the adult stage. Among the genes with a single TSS in both fetal and adult stages metabolic process, cellular process, gene expression, and cellular biosynthetic process were ranked highest based on the gene ratio (i.e. proportion of genes associated with the BP GO term divided by the number of selected genes) in both sub-species (Bonferroni-corrected P-value < 0.05, Supplementary Table 7). Also, nuclear division, mitotic cell cycle, mitosis, and organelle fission were the common BP GO terms for genes with single TSS expressed only in indicus and taurus fetal liver (Bonferroni-corrected P-value < 0.05, Supplementary Table 8). This is in line with the specific features and functions of the fetal tissues, i.e cell proliferation and differentiation.
In the current study, about 8.7% and 12.5% of the total 4,250 and 9,215 expressed genes showed divergent single/multiple TSSs usage with aging in taurus and indicus liver, respectively. These results are in line with a previous study in human and indicate that TSS switching events are common and can play a significant role in development11. About 50–64% of genes with a single TSS expressed in fetal liver, showed an increase in the number of TSS with aging from single to multiple. One example of a gene in this category is ATP6V0E1, ATPase H + transporting V0 subunit e1. It mainly has function in hydrolase activity and proton transmembrane transporter activity. It is located on chromosome 20 in the region between 4732524 and 4762537 bp. Two TSSs were found in taurus and indicus adult liver at positions 4732518 and 4732488, however only position 4732518 was observed in taurus and indicus fetal liver. Considering that TSS expression at position 4732518 doubled from fetal to adult stage, it seems that the additional TSS 30 bp apart from the main TSS in adult tissue could have increased the transcription of the main TSS. Gene ontology analysis of genes with divergent TSSs across fetal and adult stages is shown in Supplementary Table 7. In the mammalian genome, the alternative usage of multiple TSSs is an important mechanism to increase mRNA and protein diversity from a single gene31. When comparing the number of TSSs between developmental stages in the liver, there were a number of cases (3-4.6%, Fig. 5A and B) where the number of TSS went from multiple to single TSS with aging. Genes in this category were involved in cellular and primary metabolic process, and cellular process (only significant in indicus liver at Bonferroni-corrected P-value < 0.05, see Supplementary Table 7).
Furthermore, 34 and 94 genes of the total 4,250 and 9,215 expressed genes had crossover TSS switching events in taurus and indicus liver with aging, respectively, which suggests variation in the dominant TSS over time (Supplementary Table 9). Since the alternative mRNA isoforms could be translated into functionally different products, a crossover switching event may suggest that one gene can play different roles at different time points in development. The most significant BP terms related to this group was catabolic process (P-value < 0.05).
Developmental-stage TSS changes in Bos indicus lung
Lung tissue plays an important role in the respiratory system of mammals after birth. Before birth, the lung is full of liquid32–34 and does not participate in gas-exchange because of high pulmonary vascular resistance and immature respiratory function35. Therefore, it is necessary for the lung to be sufficiently developed at birth to perform the function of gas-exchange, which requires numerous physiological changes to occur36. We performed quantitative and expression analysis of CAGE-Seq data in fetal and adult lung tissues of a single Bos indicus cow-fetus pair. A total of 9,784 genes were expressed across both stages. Similar to liver tissues, a higher number of genes had a single TSS in the fetal stage compared to the adult (Fisher’s exact test, P-value < 0.005; Fig. 5C). In total, 73% (70%) of genes at the fetal (adult) stage had a single TSS in the lung. Similar to fetal liver tissues, genes in fetal lung with a single TSS were mainly involved in mitosis, nuclear division, and cellular process (Bonferroni-corrected P-value < 0.05, see Supplementary Table 8). Immune response and response to stimulus were the common significant BP GO terms for genes with a single or multiple TSSs expressed only in adult lung, respectively (Bonferroni-corrected P-value < 0.05, see Supplementary Table 8).
Similar to liver tissue, three patterns of TSS numbers per gene were seen in lung tissue between developmental stages (Fig. 5C). The first pattern was genes with multiple TSSs in the fetal stage, but just one TSS in the adult stage (e.g., HSP90AA1); the 2nd pattern was relatively constant numbers at both ages either single (e.g., SFTPC) or multiple TSSs (e.g., ACTB); and the 3rd pattern (e.g., HSP14) had one TSS in the fetal, but multiple TSSs in adult tissue. Functional annotation of genes with divergent TSSs across fetal and adult stages is shown in Supplementary Table 7 (Bonferroni-corrected P-value < 0.05). Interestingly, the first pattern was observed in all the assessed heat shock protein related genes with exception of HSPA9, HSPA13, HSPD1, and HSPH1 which reflected the 2nd pattern with aging, and for HSP14 in which the third pattern was observed (Supplementary Table 10).
Crossover switching of TSSs occurred in 221 genes between the two stages (Supplementary Table 9) and were enriched for the BP GO terms biological regulation, protein localization, and regulation of transcription (P-value < 0.05). In line with these results, it has been reported that the usage of alternative TSSs in some genes expressed in the cerebellum could play a regulatory role, such as temporally regulated expression, amplitude of expression, mRNA stability, and mRNA translational efficiency37. One example of crossover TSS switching with aging in lung tissue is BLVRB, biliverdin reductase B. We found two common TSSs, which were 31 bp apart for this gene. The dominant TSS shifted with aging and the highest total transcript gene expression was seen at the fetal stage. The crossover TSS switching in this gene may play a regulatory role, such as temporally upregulated expression in fetal lung.
Alternative TSSs usage across Bos indicus adult tissues
The number of TSSs in seven tissues from a single Bos indicus adult female is shown in Table 1. In total, the number of TSSs identified ranged between 5,974 (blood) and 15,151 (lung). About 95–97% of TSSs identified in different tissues were located in known gene transcripts. In general, 4% (thyroid) to 16% (lung) of total TSS peaks identified in known gene transcripts overlapped with coding sequence (CDS), intron, exon, 3’ UTR, and antisense (i.e. genes on the opposite strand) regions. Although there is some evidence that many novel core promoters especially for novel non-coding RNA are located in intergenic regions30, for simplicity the final dataset was restricted to the promoters-proximal, and 5’ UTR regions.
In total, 53.3% of genes expressed in all tissues had a single TSS, while only a minority (6.6%) of genes had multiple TSSs. Interestingly, a noticeable proportion of the genes (40.1%) had divergent TSS numbers, i.e. they had only one TSS in one or some tissues while having multiple TSSs in other tissues. In about 68% of genes with divergent TSSs the number of TSSs per gene in single tissues having multiple TSSs was two, 24.3% had three or more TSSs, and 7.6% had two or more TSSs across tissues. Table 1 describes in detail the number of genes with single or multiple TSSs in different tissues. This highlights that about half of the genes in cattle have multiple promoters or alternative TSSs per promoter and therefore high potential for differential transcriptional regulation across tissues. Several genome wide analyses have shown that about half of human and mouse genes have multiple promoters38, 39. Our results were in agreement with this research and indicate tissue specific usage of TSS in mammals4, 40.
Tissues are distinguished by gene expression patterns, indicating distinct regulatory processes. Individual genes, or even sets of genes, in each tissue cannot adequately capture the diversity of structure and function that exist among different tissues41, and multiple regulatory elements, including transcription factors and TSSs, that work together with other genetic and environmental factors must control the transcription of genes and production of proteins42. Alternative TSSs can significantly alter the 5’ UTR structure and therefore result in higher or lower rate of protein synthesis 43, 44. When tissues were clustered based on their correlation between TSS numbers in different genes across tissues, tissues grouped mainly together into clusters reflecting their function, liver clustered with thyroid and spleen tissues, ovary with uterus and lung (Fig. 6). The lowest proportions of single TSS genes (72%) was observed in more complex tissues such as liver tissue (Fig. 7).
A TSS survey from the FANTOM consortium which included 573 human primary cell samples, 152 human tissues, and 250 human cancer cell lines, has demonstrated that on average four TSSs exist per gene 9. In the current study, the maximum number of TSS peaks per gene ranged from 7 (thyroid; gene CCND3, UBE2D3) to 19 (blood; gene LOC112444653). LOC112444653 is a 5.8S ribosomal RNA located on chromosome 27. The minimum and maximum number of TSSs for this gene was observed in liver (single TSS), and blood (19 TSSs peaks), respectively. Gene SEPT9 is another example of a complex gene with alternative splicing resulting in six confirmed protein isoforms. It is a member of the septin family involved in cytokinesis and cell cycle control and both its expression levels and isoform composition differ among cell types 45. The lowest and highest number of TSSs for this gene was observed in ovary and muscle (single TSS), and blood (6 TSS), respectively (Fig. 8). Although the septin family of genes has diverse functions in cytokinesis46, a growing body of research has confirmed their biological roles in several cellular processes, development and physiology of specific tissues and organs, and various pathophysiological states45. Both genes are perfect examples of genes where they are regulated in a tissue-dependent manner and consistent with previous research9.
When we looked at the genes with different numbers of TSSs across tissues, the most significant BP GO terms were cytoskeleton organization, actin filament-based process, actin cytoskeleton organization, cellular macromolecule catabolism (Bonferroni-corrected P-value < 0.05, Fig. 9). By assessing the terms associated with the genes that had only one TSS (one promoter) in all tissues, we found that single TSS genes tended to exist among the housekeeping genes with BP functions such as RNA processing/splicing, translation and protein catabolic process, protein localization, protein catabolic process (Bonferroni-corrected P-value < 0.05). In agreement with these results, a previous study 47 indicated that genes with single TSS were enriched in encoding proteins with housekeeping functions, while genes with multiple TSS were enriched in regulatory pathways.
In conclusion, our results give some insight into how TSSs and frequent local insertions, deletions and duplications in the regions containing them can drive rapid evolution of species and sub-species. For example, duplication of TSSs can allow for neo-functionalization of genes, where an original gene takes on tissue specific functions following the duplication event. This is similar to the neo-functionalization following whole genome duplication model proposed by previous study48, but on a gene scale rather than a genome scale. However the TSSs duplication process will obviously be much more frequent, as demonstrated here, than whole genome duplication events.