Multi-Tissue Transcriptomes Yield Information on High-Altitude Adaptation and Sex-Determination in Scutiger cf. sikimmensis

The Himalayas are one of earth’s hotspots of biodiversity. Among its many cryptic and undiscovered organisms, including vertebrates, this complex high-mountain ecosystem is expected to harbour many species with adaptations to life in high altitudes. However, modern evolutionary genomic studies in Himalayan vertebrates are still at the beginning. Moreover, in organisms, like most amphibians with relatively high DNA content, whole genome sequencing remains bioinformatically challenging and no complete nuclear genomes are available for Himalayan amphibians. Here, we present the first well-annotated multi-tissue transcriptome of a Greater Himalayan species, the lazy toad Scutiger cf. sikimmensis (Anura: Megophryidae). Applying Illumina NextSeq 500 RNAseq to six tissues, we obtained 41.32 Gb of sequences, assembled to ~111,000 unigenes, translating into 54362 known genes as annotated in seven functional databases. We tested 19 genes, known to play roles in anuran and reptile adaptation to high elevations, and potentially detected diversifying selection for two (TGS1, SENP5) in Scutiger. Of a list of 37 genes, we also identify 27 candidate genes for sex determination or sexual development, all of which providing the first such data for this non-model megophryid species. These transcriptomes will serve as a valuable resource for further studies on amphibian evolution in the Greater Himalaya as a biodiversity hotspot.


Introduction
The Himalayas are a distinct biogeographic eco-region with high biodiversity and endemism due to great topographic and climatic variation [1] and isolation. The uplift of Tibet and the Himalayas since about 45 million years ago (Mya), with the Greater Himalayas starting to rise presumably the earliest in the post-Eocene, or even more recently (~20-10 Mya; for a review see the supplementary in Hofmann et al. [2]) appears to have resulted in a unique assemblage of species evolved under gradual high-altitude adaptation, as already shown for several Tibetan reptiles [3] and anurans [4,5] as well as interspecies comparisons with evidence for parallel evolution [6][7][8]. However, with respect to high-altitude adaptation in the Greater Himalayas, there is still a considerable knowledge deficiency as there is a general lack of knowledge about the biodiversity of this high-mountain range, causing its relatively large number of cryptic and undiscovered species [9], even among vertebrates. Despite its poorly accessible landscapes, the fragile and vulnerable Himalayan mountain ecosystems have not been spared from land use changes, anthropogenic habitat degradation, and other increasing pressures

RNA Extraction, cDNA Library Construction, and Illumina Sequencing
Total RNA was extracted from six different tissues (brain, heart, liver, lung, kidney, testes) and adjusted to equal concentrations. RNA integrity was assessed by RNA concentration, RIN value, 28S/18S and the fragment length distribution using an Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., CA, USA). Complementary DNA (cDNA) was synthesized and sequenced by BGI (BGI-Hongkong Co., Ltd.), using the Illumina NextSeq 500 sequencing system (Illumina, San Diego, USA).

Filtering, De Novo Assembly, Functional Annotation and Gene Expression
The raw Illumina reads were filtered by quality and for adaptor contamination, targeting reads containing adaptor, reads with more than 5% ambiguous bases, or reads with greater than 20% of bases with quality score below 15. The remaining high-quality, clean reads were de novo assembled to transcripts using Trinity v2.0.6 [32], based on the de Bruijn graph algorithm. Transcripts were clustered using Tgicl v2.0.6 [33] to obtain sequences that could no longer be extended. The resulting sequences were defined as unigenes. They represent expressed sequences, but are not characterized sufficiently to be denoted as a complete gene. The entire unigene set was functionally annotated by matching against seven frequently used databases, including Gene Orthology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), EuKaryotic Orthologous Groups (KOG), the nonredundant nucleotide (NT) database, the non-redundant (NR) protein database, SwissProt, and InterPro. We used Diamond v0.8.31 [34] or the BLASTx [35] algorithm with an E-value threshold of 1.0×10 −5 to align unigenes to KEGG, KOG, NR, and NT and SwissProt for the annotation. Unigenes that matched to the NR database were annotated to the GO database with Blast2GOv2.5.0 [36] and classified into three categories: biological process, cellular components and molecular functions. BLAST2GO was also used to identify protein domains against the InterPro databases using the InterProScan5 v5.11-51.0 tool [37]. Candidate coding areas of these unigenes were predicted using TransDecoder v.3.0.1 (https://github.com/TransDecoder/TransDecoder) and the Pfam protein homologous sequences were searched by Blast to SwissProt and hmmscan v3.0 [38] to predict the coding region. For a coding sequence with more than two ORF, the longest one was identified as the sequence of interest. We mapped all unigenes to AnimalTFDB2.0 database and EMBOSS getorf v6.5.70 [39] to find ORF of each unigene and aligned the ORF to Transcription Factor domains using hmmsearch.
For each sample, expression profiles were obtained by mapping clean paired-end reads back to unigenes using Bowtie2 v2.2.5 [40] with default parameters. Raw expression counts were then calculated with RSEM v1.1.20 [41]. The raw counts were normalized into FPKM values to quantify transcript levels among the different tissue samples.
Data generated in this study are publicly available from the NCBI GenBank database under the Bioproject ID PRJNA532534. All clean sequence data were deposited in the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra/) under the accession numbers SRR9953599-SRR9953604, assembled sequences were transmitted to NCBI Transcriptome Shotgun Assembly

RNA Extraction, cDNA Library Construction, and Illumina Sequencing
Total RNA was extracted from six different tissues (brain, heart, liver, lung, kidney, testes) and adjusted to equal concentrations. RNA integrity was assessed by RNA concentration, RIN value, 28S/18S and the fragment length distribution using an Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., CA, USA). Complementary DNA (cDNA) was synthesized and sequenced by BGI (BGI-Hongkong Co., Ltd.), using the Illumina NextSeq 500 sequencing system (Illumina, San Diego, USA).

Filtering, De Novo Assembly, Functional Annotation and Gene Expression
The raw Illumina reads were filtered by quality and for adaptor contamination, targeting reads containing adaptor, reads with more than 5% ambiguous bases, or reads with greater than 20% of bases with quality score below 15. The remaining high-quality, clean reads were de novo assembled to transcripts using Trinity v2.0.6 [32], based on the de Bruijn graph algorithm. Transcripts were clustered using Tgicl v2.0.6 [33] to obtain sequences that could no longer be extended. The resulting sequences were defined as unigenes. They represent expressed sequences, but are not characterized sufficiently to be denoted as a complete gene. The entire unigene set was functionally annotated by matching against seven frequently used databases, including Gene Orthology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), EuKaryotic Orthologous Groups (KOG), the non-redundant nucleotide (NT) database, the non-redundant (NR) protein database, SwissProt, and InterPro. We used Diamond v0.8.31 [34] or the BLASTx [35] algorithm with an E-value threshold of 1.0 × 10 −5 to align unigenes to KEGG, KOG, NR, and NT and SwissProt for the annotation. Unigenes that matched to the NR database were annotated to the GO database with Blast2GOv2.5.0 [36] and classified into three categories: biological process, cellular components and molecular functions. BLAST2GO was also used to identify protein domains against the InterPro databases using the InterProScan5 v5.11-51.0 tool [37]. Candidate coding areas of these unigenes were predicted using TransDecoder v.3.0.1 (https://github.com/TransDecoder/TransDecoder) and the Pfam protein homologous sequences were searched by Blast to SwissProt and hmmscan v3.0 [38] to predict the coding region. For a coding sequence with more than two ORF, the longest one was identified as the sequence of interest. We mapped all unigenes to AnimalTFDB2.0 database and EMBOSS getorf v6.5.70 [39] to find ORF of each unigene and aligned the ORF to Transcription Factor domains using hmmsearch.
For each sample, expression profiles were obtained by mapping clean paired-end reads back to unigenes using Bowtie2 v2.2.5 [40] with default parameters. Raw expression counts were then calculated with RSEM v1.1.20 [41]. The raw counts were normalized into FPKM values to quantify transcript levels among the different tissue samples.
Data generated in this study are publicly available from the NCBI GenBank database under the Bioproject ID PRJNA532534. All clean sequence data were deposited in the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra/) under the accession numbers SRR9953599-SRR9953604, assembled sequences were transmitted to NCBI Transcriptome Shotgun Assembly Sequence Database (TSA, http://www.ncbi.nlm.nih.gov/genbank/tsa); the annotation dataset and statistics have been uploaded to figshare (doi 10.6084/m9.figshare.9913202).

Tests for Episodic Diversification in Selected Genes, Relevant for High Altitude Adaptation
We used the 21 most prominent genes, exhibiting convergent and continuous genetic adaptation to high elevations in Ranidae and Agamidae as provided by the authors of [6] upon request for assembled transcripts and annotated coding regions; thus, no whole-transcriptome selection analysis was feasible. The unigene annotation was screened for matches to the gene symbols of altitude adapted genes. Matching unigenes were filtered from the assembled transcriptome sequences. These candidate sequences were aligned with NR protein database to manually confirm the gene assignment. Coding sequence (CDS) of the transcripts was determined with TransDecoder. CDS were aligned codon based with the sequences from Sun et al. [6] using TranslatorX [42]. Finally, we removed stop codons and all codons that contained unaligned positions in at least one species. The alignments were submitted to aBSREL analysis at www.datamonkey.org [43,44], and therein were analyzed for potential diversifying selection at all branches, comparing selection through uncorrected p-values in the Nanorana-clade to those in Scutiger.

Identification of Genes Involved in Sex Determination or Sex Differentiation
Using a list of anuran genes presumably involved in male or female sex determination and sexual differentiation [45] to identify templates, we recovered such genes from the multiple tissues in Scutiger and studied their tissue-specific expression. Template protein sequences from the anuran model species Xenopus tropicalis or Xenopus laevis were obtained from Xenbase [46] and aligned with BLAT [47] (−t = dnax -q = prot) against Scutiger transcripts assembled from RNAseq data. Matching transcripts were identified and aligned to the NR database with BLASTx [35,48]. After manual inspection, we identified and removed paralogs or unspecific sequences. Unigenes were considered homologous, if the majority of the top scoring BLASTx hits were close to the gene of interest.

Transcriptome
From separate cDNA libraries of six tissues (brain, heart, kidney, liver, lung, testis), a total of 425.4 million raw sequence data (41.32 Gb) were generated. After filtering and quality checks, 413.4 million clean reads were obtained, with an average of 68.9 million reads per tissue ( Table 1). The GC-content of these clean reads ranged between 44.78% and 46.19%; 97% of all bases had a quality score of at least Q20. Over all samples, the assembly resulted in 110,889 non-redundant unigene sequences with a total length of 104,314,420 bp. Unigenes had an average length of 940 bp and a N50 of 1926 bp. A total of 103,396 unigenes ranged in length from 300 to 3000 bp (Table 2, Figure 2 and Supplementary Figure S1).  N50 length = weighted median statistic that 50% of the total length is contained in unigenes that are equal to or larger than this value.
N50 length = weighted median statistic that 50% of the total length is contained in unigenes that are equal to or larger than this value.

Functional Annotation
Altogether, 54362 (49.02%) unigenes showed homology to entries in at least one of the seven functional databases (Figure 3, Supplementary Table S1); 11,933 unigenes were annotated by all these databases (Supplementary File S1). Most matches were obtained in the NR database, with 48805 unigenes (44.01%), of which 26,071 (23.51%) could be assigned to gene ontology terms. Regarding the biological process ontology, the most common classifications were cellular processes (13,065), metabolic processes (9499) and biological regulation (6523). The most common categories for the cellular component ontology were cellular (13,125), cell compartments (13,004) and organelles (9276). In terms of the molecular function, most common categories involved binding (13,958), catalytic activity (9871) and molecular transporter activity (1479) (Supplementary Figure S2).

Functional Annotation
Altogether, 54362 (49.02%) unigenes showed homology to entries in at least one of the seven functional databases (Figure 3, Supplementary Table S1); 11933 unigenes were annotated by all these databases (Supplementary File S1). Most matches were obtained in the NR database, with 48805 unigenes (44.01%), of which 26,071 (23.51%) could be assigned to gene ontology terms. Regarding the biological process ontology, the most common classifications were cellular processes (13,065), metabolic processes (9499) and biological regulation (6523). The most common categories for the cellular component ontology were cellular (13,125), cell compartments (13,004) and organelles (9276). In terms of the molecular function, most common categories involved binding (13,958), catalytic activity (9871) and molecular transporter activity (1479) (Supplementary Figure S2).  Figure S3).

Testing Potential Genetic Adaptation to High Elevations
From the Scutiger transcriptome-based gene set, we were able to successfully align 19 out of 21 anuran genes, exhibiting convergent and continuous genetic adaptation to high elevations in Ranidae and Agamidae (Table in Figure 1 in Sun et al. 2018 [6]). We did not recover the genes XPS and TINAGL1 from Scutiger. Using aBSREL, we then examined each of the 19 alignments for the occurrence of episodic diversification. We detected evidence of diversifying selection in two Scutiger genes (TGS1: uncorr. p = 0.0339, corr. p = 0.2373; SENP5: uncorr. p = 0.0053, corr. p = 0.0475). In all analyses, the test statistics confirmed the selection shown by Sun et al. in Nanorana, as reported by these authors (Supplementary File S2: note that corrected p-values are provided therein as "Test p-value") [6].

Genes Related to Sex Determination or Sex Differentiation and Their Expression Levels
Of a list of 37 genes from the clawed frog genomes, about 73% (27) had significant blast hits in the Scutiger transcriptomes, all of which provide the first sequence data for these genes in this non-model anuran species. In comparison to the length of Xenopus mRNAs, which were used as BLAST templates, the majority of the Himalayan Scutiger unigenes is shorter (between 4% and 98%) and may thus represent fragments. However, unigenes are also up to 3.5 times longer, which may be indicative of insertions ( Table 3). The majority (∼49%) of the target genes were expressed in all tissues and only ∼5% were expressed solely in a single one. Table 3. Sex-determining and sex differentiation gene inventory and homologous Scutiger cf. sikimmensis transcripts. Gene expression is calculated as FPKM based on RNA-seq data from six tissues; the highest gene expression level per gene over all tissues is indicated in bold. CL = cluster of several unigenes.

Discussion
We here report the first well-annotated transcriptome data of a high-altitude amphibian taxon from the Greater Himalaya, based on RNAseq of multiple tissues. Our work, based on a male Scutiger cf. sikimmensis from central Nepal, provides a transcriptome of a representative species from a branch of the amphibian tree of life (Pelobatoidea of some authors [49]; Mesobatrachia of others [50], currently Megophryidae [27]), from which no complete genomes have been sequenced. Although genomic approaches have been used [14,15], to our knowledge no annotated multi-tissue transcriptome has been published for the genus Scutiger. Our study is mostly descriptive, but it has yielded novel discoveries and represents an important turning point for genomic studies in megophryd anurans.

Indications for Potential Genetic Adaptation to High Elevations in Scutiger
Sun et al. [6] used comparative transcriptomic analyses to prove that amphibian and reptile populations, occurring at different altitudes around the Tibetan Plateau, show parallel evolution. They have provided evidence for convergent and continuous genetic adaptation to high elevations in taxa as distant as Anura (Ranidae) and Sauropsida (Agamidae), in which genes with related functions, especially DNA-repair and energy metabolism pathways, exhibit evidence for rapid change and continuous positive selection with increasing elevations. These data let us assume that a similar genomic high-elevation-selection-syndrome might be detectable in Scutiger, sampled in 3289 m above sea level (Methods). Indeed, in two out of 19 key genes [6], we have detected diversifying selection and thus potentially positive selection in the Scutiger transcriptome as well. In another three of these key genes (PGS: uncorr. p = 0.180, corr. p = 1.000; OLFM4: uncorr. p = 0.0818, corr. p = 0.6542; PPIL2: uncorr. p = 0.185, corr. p = 1.000) we obtained uncorrected p-values close to significance level, possibly indicating weak signs of adaptive molecular evolution. However, the analyses of only a single Scutiger-transcriptome (data were unobtainable from a whole radiation of species or an altitudinal gradient as available for Ranidae [6]) limits our approach and did not allow for further comparative tests. Therefore, future research with multiple species and a greater scale of altitudinal variation might yield additional evidence for genes adapted to the life in high altitudes in lazy toads. This seems especially justified since similar traits have been studied extensively in other vertebrates, such as mammals including humans [51].

Expression of Genes Involved in Sex Determination or Sex Differentiation
As a second application, we used the transcriptomes of Scutiger cf. sikimmensis to search for candidate genes, known for their role in sex determination or sex differentiation in other vertebrates (e.g., [45]). Most, if not all, anurans exhibit genetic sex determination [52][53][54] and generally exhibit sex-biased gene expression, depending on phenotypic sex [55]. As most anurans [56], to our knowledge, Scutiger exhibits homomorphic sex chromosomes and nothing is known about sex determination in Scutiger.
Of the 27 genes with potential sex roles, detected in Scutiger cf. sikimmensis, two tissues, the brain (32%) and the testes (22%), exhibited the highest numbers of highly expressed genes, as typical of sex determination or differentiation genes. About a quarter of the relevant 37 genes have remained uncharacterized, presumably, since their levels of expression was below the threshold to be assembled into unigenes or might be expressed only during earlier developmental stages or exclusively in females.

Conclusions
Overall, our data provide the first well-annotated multi-tissue transcriptomic resource for a Himalayan amphibian of the genus Scutiger. This transcriptome is available for further studies on evolution and adaptation in Himalayan high-altitude vertebrate species.