De novo transcriptome assembly and annotation of parthenogenetic lizard Darevskia unisexualis and its parental ancestors Darevskia valentini and Darevskia raddei nairensis

Darevskia rock lizards include 29 sexual and seven parthenogenetic species of hybrid origin distributed in the Caucasus. All seven parthenogenetic species of the genus Darevskia were formed as a result of interspecific hybridization of only four sexual species. It remains unknown what are the main advantages of interspecific hybridization along with switching on parthenogenetic reproduction in evolution of reptiles. Data on whole transcriptome sequencing of parthenogens and their parental ancestors can provide value impact in solving this problem. Here we have sequenced ovary tissue transcriptomes from unisexual parthenogenetic lizard D. unisexualis and its parental bisexual ancestors to facilitate the subsequent annotation and to obtain the collinear characteristics for comparison with other lizard species. Here we report generated RNAseq data from total mRNA of ovary tissues of D. unisexualis, D. valentini and D. raddei with 58932755, 51634041 and 62788216 reads. Obtained RNA reads were assembled by Trinity assembler and 95141, 62123, 61836 contigs were identified with N50 values of 2409, 2801 and 2827 respectively. For further analysis top Gene Ontology terms were annotated for all species and transcript number was calculated. The raw data were deposited in the NCBI SRA database (BioProject PRJNA773939). The assemblies are available in Mendeley Data and can be accessed via doi:10.17632/rtd8cx7zc3.1.

reptiles. Data on whole transcriptome sequencing of parthenogens and their parental ancestors can provide value impact in solving this problem. Here we have sequenced ovary tissue transcriptomes from unisexual parthenogenetic lizard D. unisexualis and its parental bisexual ancestors to facilitate the subsequent annotation and to obtain the collinear characteristics for comparison with other lizard species. Here we report generated RNAseq data from total mRNA of ovary tissues of D. unisexualis, D. valentini and D. raddei with 58932755, 51634041 and 62788216 reads. Obtained RNA reads were assembled by Trinity assembler and 95141, 62123, 61836 contigs were identified with N50 values of 2409, 2801 and 2827 respectively. For further analysis top Gene Ontology terms were annotated for all species and transcript number was calculated. The raw data were deposited in the NCBI SRA database (BioProject PRJNA773939

Value of the Data
• Data provides information about the first assembled ovary transcriptomes of three genetically related Darevskia lizards species and information about their genes and proteins. • This data may benefit evolutionary biologists because it shows genetic differences between unisexual (parthenogenetic) and bisexual parental lizards. • The data may provide insight into the genetic underpinning of parthenogenetic reproduction and can be used in further study of these genes.

Data Description
Ovary RNA from three individuals of each species was pooled together and used to prepare the three cDNA libraries: D. unisexualis , D. raddei nairensis, D. valentini . Table 1 shows the total number of bases, reads, GC (%), Q20 (%), and Q30 (%) that were calculated for the three samples. The characteristics of assembled transcriptome sequences are presented in Table 2 . Structural characteristics of three transcriptomes are shown in Fig. 1 . Obtained Trinity assemblies contain 60132 transcripts for D. unisexualis , 41680 for D. valentini, and 413664 for D. r. nairensis . TransDecoder peptide output was used for BLASTP, Pfam, and EggNOG search ( Fig. 1 A, Supplementary 1 ). BLASTP v. 2.9.0 + revealed 14049, 12331, and 11865 proteins for D. unisexualis, D. valentini, and D. nairensis respectively ( Fig. 1A ). Parthenogenetic species D. unisexualis showed greater TRINITY contigs ( > 81.4% and > 87.2%) and transcripts ( > 44.3% and > 45.4%) numbers than D. valentini and D. r. raddei respectively ( Fig. 1 B). The D. unisexualis showed more hits for each of the searching instruments. Top 10 GO terms taken from all GO terms datasets and distribution graphs are presented in Fig. 2 ( Supplementary 2 ) . The biggest number of annotated genes and the most annotated category was a cellular component, biological processes were less annotated. In the molecular functions category, most genes were related to binding. The most highly enriched genes in biological processes were related to the regulation of transcription of RNA polymerase II. It was found that in cellular components over-represented molecules were the nucleus and cytoplasm origin. In total, 38844, 38756, 63219 transcripts with GO terms were annotated in Table 3 for D. valentini, D.raddei, and D. unisexualis respectively. The summary of Trinotate shows a Q20 -ratio of bases with probability of containing no more than one error in 100 bases. b Q30 -ratio of bases with probability of containing no more than one error in 1,0 0 0 bases.

Table 2
Summary characteristics of transcriptome sequence assembly of all three samples data.  Table 3 ). The final TransDecoder stats are presented in Table 4 . The overall number of ORFs in D. unisexualis was 45.7% more than in bisexual parental samples, according to the TransDecoder results.. The analysis of common and unique genes on Venn diagrams ( Fig. 3 A, B) 027492 N, 43.816634 E) were used to surgically extract ovary. Before dissecting out the organs, the animals were subjected to chloroform euthanasia. All tissue samples were stored in RNAlater ® reagent at −20 °C according to the manufacturer's recommended protocol (Qiagen Inc.) until they were shipped to Macrogen Inc. (Korea) for RNA extraction and further transcriptome preparation.

RNA sequencing and raw data quality control
Total RNA was isolated from an organ/tissue using standard Trizol Tissue RNA Extraction protocol (Standard protocol for QIAzol Lysis Reagent, Qiagen). RNA RIN scores ranged from 6.4 to 6.7. Ovary RNA from three individuals of each species was pooled together and used to prepare the three cDNA libraries: D. unisexualis , D. r. nairensis, D. valentini . Inside the procedure was a cleanup on a carrier with polyT and random primers from TruSeq Stranded mRNA kit were used for preparation cDNA. The paired-end sequencing libraries were prepared by random fragmentation of the cDNA samples into 350-500 bp fragments, followed by 5 and 3 adapter ligation using TruSeq RNA Sample Prep Kit v2 (Illumina Inc.) according to TruSeq RNA Sample Preparation Guide (Version 2, Part #15026495 Rev.F).
Sequencing of transcriptome libraries was performed on Illumina HiSeq2500 with a mean read length of 101 bp. The Illumina Hiseq generated raw sequencing data utilizing HiSeq Control Software v2.2 for system control and base calling through an integrated primary analysis software. The BCL (base calls) binaries were converted into FASTQ by the Illumina package bcl2fastq (v1.8.4) [1] (RRID:SCR_015058). Raw transcriptome data were trimmed by Trimmomatic v0.39 [2] to remove adapters. Optical duplicates from reads were removed by the rmdup tool [3] . Raw transcriptomes contained 58932755, 51634041, and 62788216 reads for D. unisexualis, D. valentini and D. r. nairensis with GC content of 48.05%, 46.59%, and 46.77% respectively. Filtered reads quality was estimated by FastQC v0.11.9 [4] and became prepared for assembling.

Transcriptome annotation and assembly
Reads obtained after trimming and quality estimating by FastQC and Seq2fun pipeline [5] were assembled using Trinity v2.1.1 [6] . Transcriptome assembly with Trinity can be divided into several parts: searching and calculating k-mers, assembling contigs from k-mers, clustering contigs into components. For Trinity assembler, the default parameters were taken, where the minimum contig length value was 200, k-mer size was 25. TransDecoder v5.5.0 [7] program was used to predict translated proteins and ORFs (open reading frames) from assembled transcripts with at least 100 amino acids length. NCBI-blast-2.9.0 + [8] was used for homology search and protein domain identification on TransDecoder predicted proteins with such parameters as e-value < 1e-5 and percentage of similarity > 95%.
The total number of bases, reads, GC (%), Q20 (%), and Q30 (%) were calculated for the three samples ( Table 1 ). Obtained protein sequences from TransDecoder were cross-referenced with the Gene Ontology (GO) [9 , 10] database using the EggNog v2.0.1 [11] tool. This tool provides functional information in the context of structure, molecular functions, the biological process of query sequences, search matches, and performs them as GO terms. Top GO terms were determined and visualized using the Trinotate package in the TransPi [12] pipeline. PFAM [13] and BLASTP searches were also performed by the TransPi pipeline with OnlyAnn (only annotation) option. This mode used such databases as Swissprot, Uniprot custom database (available under request), and Pfam.

AMPs identification
To identify antimicrobial peptides (AMP) in the transcriptome, we blasted the assembled transcripts against the known AMPs from the DRAMP 3.0 database (Data Repository of Antimicrobial Peptides) [14] using BLAST-2.2.26 + [15] with the similarity cutoff of 70%.

Ethics Statement
All individuals were hand-caught; alive-animal handling procedures were approved by Yerevan State University according to the ethical guidelines, capture permit Code 5/22.1/51043 was issued by the Ministry of Nature Protection of the Republic of Armenia for scientific studies. The study was approved by the Ethics Committee of the Moscow State University (Permit Number: 24-01) and conducted strictly according to ethical principles and scientific standards.

Declaration of Competing Interest
All authors have read and approved the final manuscript. Consent for publication: Not applicable. The authors declare that they have no competing interests.