Genome-wide identification and characterization of long noncoding and circular RNAs in germline stem cells

Germline stem cells are germ cells at an early developmental stage, so their development is key to ensuring human reproduction. There is increasing evidence that long noncoding RNA (lncRNA) and circular RNA (circRNA) play important roles in the development of germ cells. This data descriptor provides unique lncRNA and circRNA transcriptomic information for mouse germline stem cells. Using the Illumina HiSeqx 2000 system, a total of 511,836,732 raw reads were generated. High-quality transcripts, lncRNAs, and circRNAs were identificated and quantified using the reads, and more precise annotations of lncRNAs (especially 9357 novel lncRNAs) and circRNAs were performed in the germline stem cells. We then analyzed the transcript structures, genetic variants, and the interaction between circRNA and microRNA to provide the basis for subsequent functional experiments. This comprehensive dataset will help advance data sharing and deepen our understanding of mouse germline stem cells, providing a theoretical foundation for research on germ cell development and human reproduction, among others. Design Type(s) transcription profiling design • cell type comparison design Measurement Type(s) transcription profiling assay Technology Type(s) RNA sequencing Factor Type(s) technical replicate Sample Characteristic(s) Mus musculus • testis • ovary Design Type(s) transcription profiling design • cell type comparison design Measurement Type(s) transcription profiling assay Technology Type(s) RNA sequencing Factor Type(s) technical replicate Sample Characteristic(s) Mus musculus • testis • ovary Machine-accessible metadata file describing the reported data (ISA-Tab format)


Background & Summary
There is growing recognition that cells, especially in mammals, produce many thousands of noncoding transcripts. long noncoding RNA (lncRNA), a group of noncoding RNAs that are longer than 200 bp, are emerging as potent regulators of gene expression, and have dramatically altered our understanding of cell biology under pathological conditions 1,2 . Owing to the strong time-and tissue-specificity of lncRNA expression, there is great potential for novel lncRNA prediction. Circular RNAs (circRNAs) are a recently discovered group of noncoding RNAs identified by the presence of a special circular structure formed by covalent bonds 3,4 . In eukaryotic organisms, circRNAs are mostly present in the cytoplasm, but a few intron-cyclized circRNAs are localized in the nucleus 5 . They are widespread in mammalian cells, sometimes being more common than linear RNA. CircRNAs are also highly conserved and are not easily degraded by RNase. As millions of transcripts are generated by next-generation sequencing, many lncRNAs and circRNAs have been identified, but few have been functionally characterized. In this context, it is important to be able to determine the connections between lncRNAs/circRNAs and their target mRNAs, and to clarify their potential functions. RNA-seq is an effective technology that utilizes the features of next-generation sequencing to study lncRNA and circRNA 6 . It can be used in combination with reference annotation-based assembly, which facilitates the detection of novel isoforms being applied in fundamental scientific research, clinical diagnosis, pharmacogenomics research, and drug R&D, among others 7 .
It has been reported that approximately 48.5 million couples experience problems with infertility each year globally 8 . In recent years, with rapid economic growth and the acceleration of economic globalization, there have been increases in the proportion of women pursuing careers and the proportion of marriages that end in divorce. As such, childbearing has been put off until a later age, but reproductive capacity also declines with age. Taking all of these factors together, it is understandable that infertility rates continue to grow. Germline stem cells are germ www.nature.com/scientificdata www.nature.com/scientificdata/  RNA extraction, cDNA library establishment, and Illumina sequencing. Total RNA was extracted from SSCs and FGSCs using Trizol reagent (Invitrogen, CA, USA), following the manufacturer's protocol, and the RNA integrity number (RIN) was determined to evaluate RNA integrity using an Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Ribosomal RNA was removed from total RNA to maximize the retention of all ncRNA. The obtained RNA was randomly cut into short fragments, and then this fragmented RNA was used as a template to synthesize the first strand of cDNA using random hexamers. The second strand was synthesized by adding buffer, dNTPs, RNase H, and DNA polymerase I. After purification using the QiaQuick PCR kit, end repair with EB buffer, single nucleotide A (adenine) addition, and connection with adapters, the  www.nature.com/scientificdata www.nature.com/scientificdata/ second strand was finally degraded using uracil-N-glycosylase (UNG) 23 . Then, agarose gel electrophoresis was used to select fragment size. The suitable fragments were selected as templates for PCR amplification. Finally, a sequencing library was constructed and sequenced using the Illumina HiSeq 2000 platform (Table 1).
Filtering out "dirty" raw reads and alignment of reads to ribosomal RNA. To ensure the quality of data, raw data should be quality-controlled before information analysis, and data noise could be reduced by filtering. We here refer to reads containing an adapter, excessive "N, " or a large number of low-quality bases as "dirty" raw reads, which need to be removed before information analysis. The steps of filtering are as follows: 1) remove reads with adapters; 2) remove reads containing more than 10% "N"; and 3) remove low-quality reads (bases with Q < 10 constitute more than 50% of all reads). After filtering, 233,978,622 and 246,157,442 clean reads remained in the SSC and FGSC libraries, respectively, and were used for downstream bioinformatic analysis.  www.nature.com/scientificdata www.nature.com/scientificdata/ lncRNA identification. We used the transcriptome data comparison software TopHat2 24 to compare the filtered reads to the reference genome (UCSC mm10). Reads were assembled with Cufflinks 25 after mapping to the genome. Considering the incomplete assembly of transcripts due to read coverage gaps, we performed reference annotation-based transcript (BRAT) 26 assembly. The final assembled transcripts were compared with the reference gene, and the fragments that were roughly identical to the known transcripts were removed. After the assembly, we obtained the whole parsimonious set of transcripts; to detect the novel transcripts from the initial assembly, we compared the assembly transcripts to the reference annotation by utilizing Cuffcompare 25 . We utilized Cuffmege to merge several assemblies together; it automatically filtered a number of transfrags that were probably artifacts and produced a single annotation file for use in downstream differential analysis 27 . We evaluated and compared several software programs for lncRNA prediction, and chose Coding-Non-Coding Index (CNCI) 28 , Coding Potential Calculator (CPC) 29 , and iSeeRNA 30 , which performed well compared with the other software in both accuracy and efficiency. The findings that matched across all of the software were used to define novel lncRNA transcripts. For known lncRNA identification, we used the Gencode (GRCm38, M19) (https:// www.gencodegenes.org/mouse/). The information on the annotation of lncRNAs in the Gencode database is currently more comprehensive than that in other databases, and the classification information of lncRNAs is more detailed. At the same time, we integrated 13 other databases to further verify the reliability of the lncR-NAs: noncode (V4.0), RefSeq, UCSC (mm10), Ensemble, Hox_ncRNAs, Antisense_ncRNA_pipeline, fRNAdb, Affymetrix, TUCP, lncRNAdisease, eRNAs, ncRNA_imprint, and Hox_cluster_ncRNAs. The final potential lncR-NAs were obtained by filtering the above basic properties and coding potential. Overall, 25704 expressed lncR-NAs (16347 known and 9357 novel lncRNAs), 14270 expressed protein coding transcripts were identified and mmu-miR-329-3p 9-3 3 9 9-9-9 9-9-9-9 9 9 3 3 9 9 9-9-9 329 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 9 9 9 9 9 9 9 9 9 29 2 29 2 2 2 2 mmu-miR- 7025-3p  3p 3p 3 3p 3 3p 3p 3p 3 3 3p 3p 3 3 3 3 3 3 3  7 7 7  3p p  3p 3  www.nature.com/scientificdata www.nature.com/scientificdata/ subjected to further analysis (Fig. 3a). The lncRNA identification data were deposited in figshare 22 . To display the distribution of lncRNA candidates along the chromosomes more intuitively, chromosome density distributions over the regions that were annotated as lncRNAs were analyzed statistically 22 . Moreover, we presented examples of RNA-seq read density over randomly selected mRNA, known lncRNA and novel lncRNA (Fig. 3b). Based on the fragments per kilobase of transcript per million mapped reads (FPKM) of each transcript, calculated by "Cufflinks" "abundance estimation mode" across SSCs and FGSCs, we compared the differences in expression among known lncRNAs, novel lncRNAs, and protein-coding genes. The average expression levels of lncRNAs were lower than those for protein coding genes while those of novel lncRNAs were similar to those of known lncRNAs (Fig. 3c). According to its location relative to nearby protein-coding genes, a lncRNA could be classified as sense overlap lncRNA, bidirectional lncRNA, antisense lncRNA, or intergenic lncRNA. When comparing the expression levels of the different classes of the lncRNAs, we found that the expression levels of lncRNAs of each type were similar (Fig. 3d).   www.nature.com/scientificdata www.nature.com/scientificdata/ In order to analyze the structural charateristics of lncRNAs in the germline stem cells, we analyzed and compared the transcript length distribution, ORF length, and exon number between lncRNAs and mRNAs. Our analyses showed that transcript length distribution, ORF length, and exon number of lncRNAs were different from those of protein-coding transcripts. (Fig. 4a-c). The average transcript lengths of known and novel lncR-NAs were much shorter than those of protein-coding transcripts. (Fig. 4a). Next, we compared the ORF length between the lncRNAs and the mRNAs. The principle of ORF analysis is based mainly on the six-frame translation of nucleic acids. Our results showed that the average ORF lengths of the lncRNAs and the mRNAs were 86.24 bp and 394.84 bp, respectively (Fig. 3b). This indicated that the ORF length of mRNAs were significantly longer than those of the lncRNAs. The main reason behind this is that lncRNAs do not encode proteins. Finally, the number of exons of lncRNAs and mRNAs was compared and analyzed, and the results showed that known and novel lncRNAs had fewer exons than mRNAs (Fig. 3c). lncRNA annotation and functional prediction. The annotation and functional prediction of lncR-NAs were performed based on the mechanism of lncRNA function 17 : (1) Interaction analysis of complementary lncRNA-mRNA. To reveal the interaction between antisense lncRNA and RNA, we used RNAplex 31 , software for finding short interactions between two long strands of RNA, to predict complementary binding between antisense lncRNA and mRNA. The program includes the Vienna RNA package, and calculates the minimum free energy according to the thermodynamic structure to predict the best base pairing relationship. The results showed the best lncRNA-mRNA base pairing sites and the minimum free energy of antisense lncRNA and its corresponding mRNA. (2) For lncRNAs up/downstream of a gene, we annotated those classified as being located in an "unknown region" in the former analysis, if they were upstream or downstream of a gene. These lncRNAs could potentially overlap with cis-regulatory elements that are probably involved in transcriptional regulation. (3) Pre-miRNA prediction. We aligned lncRNAs to miRBase 32 to detect potential pre-miRNAs, with those showing hit coverage higher than 90% being selected. Support vector machine (SVM)-based software miRPara 33 was also used to predict probable miRNAs. It classified sequences from miRBase into animal, plant, and overall categories and used an SVM to train three models based on the physical properties of pre-miRNA and its miRNAs. (4) lncRNA family prediction. To better annotate lncRNA at the level of evolution, we used INFERNAL to classify all predicted lncRNAs according to their conserved sequence and secondary structure through multiple sequence alignment, secondary structure, and a covariance model 33 . The lncRNA annotation data were deposited in figshare 22 . circRNA identification. In accordance with the structural characteristics and splicing sequence characteristics of circRNAs, we used the following methods to identify them [34][35][36] : (1) Given the circular character of circR-NAs, a database of junction reads was first established by using clean data that could not be compared with the mouse genome. With these junction reads as anchors, alternatively spliced transcripts were assembled by extending Cufflinks to both ends. (2) The assembled transcripts were divided into two sections centered on the junction reads and BLAST localization of the genome was performed; if the positions of the two sections in the mouse genome were reversed, the transcripts were considered as candidate circRNAs. (3) The candidate circRNAs were further filtered for protein-coding potential. The filtering parameters were as follows: Pysocsf (score < 100), CPC (score < 10), CNCI (score < 0), and Pofam (score < 0). Overall, 18822 expressed circRNAs derived from 5334 host genes were identified and subjected to further analysis (Fig. 5a). Among these 18822 circRNAs, 98% were exonic circRNAs (Fig. 5b). Moreover, the vast majority of exonic circRNAs identified were processed from exons in the middle of their hosting genes, with only a few including the first or the last exons (0.1% for the first exon, and none for the last exon) (Fig. 5c). The circRNA identification data were deposited in figshare 22 .
To display the distribution of circRNA candidates along the chromosomes more intuitively, we used Circos software (http://circos.ca/) to map the genomic locations of circRNAs screened as described above. We performed genome mapping according to different samples of circRNAs (Fig. 5d). We also analyzed the expression density of circRNAs in the germline stem cells. The results showed that the expression density of each sample conformed to the normal distribution and the average expression levels of circRNAs were lower than those of the proteincoding genes (Fig. 5e). circRNA annotation and functional prediction. On the basis of the relationship between the locations of circRNAs in the genome and protein-coding genes, the screened circRNA candidates were annotated, which mainly involved categorical and functional annotations. The categorical annotations of circRNAs were mainly based on their positional relationships in the genome, which can be divided into five types: intronic circRNA, exonic circRNA, antisense circRNA, sense overlapping circRNA, and intergenic circRNA. The functional annotation of circRNAs was based on the mechanism by which the circRNAs form. This mainly involves the annotation of circRNA function according to the corresponding circRNA-hosting gene function, including gene function annotation, gene ontology (GO) annotation, and Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation 35 . The circRNA annotation data were deposited in figshare 22 . Analysis of circRNA-miRNA interaction network. The analysis of interactions between circRNAs and miRNAs was mainly based on Targetscan 7.0 software (http://www.targetscan.org/) and miRanda software (http://www.microrna.org/microrna/home.do). The former software predicts the target of microRNAs based on the seed region, while the latter is mainly based on the size of the binding free energy between circRNA and miRNA. The smaller the binding free energy, the stronger the binding ability, and the screening threshold is max. energy ≤ −20. An entire circRNA-miRNA interaction network of germline stem cells was constructed 22 , part of which was delineated by Cytoscape (Fig. 6). www.nature.com/scientificdata www.nature.com/scientificdata/ Identification of SNps, indels, AS, and DeU. Based on data at the transcriptome level, the SNP loci in coding regions were analyzed. According to the results of the TopHat comparison between each sample and the reference genome, Samtools software 37 was used to mpileup the possible SNP and indel information of each sample, and Annovar software 38 was used to annotate it 22 . We selected Alternative Splicing Detector (ASD; available at http://www.novelbio.com/asd/ASD.html) as a tool to detect the cases with differential alternative splicing based on a bam file after mapping and using the mouse genome sequence as a reference, based on a P-value threshold of <0.05 (Fig. 7a). DEU analysis is currently used for alternative exon usage in alternative splicing 22 .In this study, DEXSeq software 39 was used for DEU analysis (Fig. 7b). DEXSeq uses a generalized linear model to detect differentially expressed genes at the exon level 22 .

Data Records
The identification, annotation data (circRNAs and lncRNAs) as well as SNPs, indels, AS, DEU and circRNA-miRNA interaction network were uploaded to figshare 22 . The original and normalized data associated with the samples analyzed in this study are deposited at Gene Expression Omnibus (GEO) datasets 40 .

Technical Validation
Quality control-RNA integrity. The RNA integrity number (RIN) of total RNA in germ stem cells was determined to evaluate RNA integrity using an Agilent Bioanalyzer 2100 (Agilent Technologies). The results showed that the RIN value was more than 7.0 for each sample. The values met the requirements for a noncoding RNA sequencing library.
Quality validation and analyses. We applied FastQC v0.11.5 software to determine data quality and analyzed several variables reflecting this 41 . A representative summary plot is depicted (FGSC-3). Here, the per base sequence quality was high, with a median quality score above 30, suggesting high-quality sequences across all bases (Fig. 8a). We also created a figure of base composition and base quality for clean data. The results showed satisfactory base composition because the T curve is in accordance with the A curve; meanwhile, the C curve is in accordance with the G curve (Fig. 8b). The per sequence GC content was examined. The pattern of GC composition was similar to the theoretical distribution, indicating that the samples were free from contamination (Fig. 8c). In addition, the sequence length distribution also corresponded to the theoretical curve (Fig. 8d). All other FastQC files were shown to have similar quality metrics compared to sample FGSC-3.
The data provided in these experimental datasets are the first reported genome-wide lncRNA and circRNA transcriptome resources for male and female mouse germline stem cells, which include SNPs, indels, AS, and DEU analyzed by Illumina high-throughput sequencing technology. These findings are useful for the identification of lncRNAs and circRNAs related to the self-renewal and sex-specific properties required for differentiation into gametes, as well as the development of polymorphic genetic markers in germline stem cells and other related research.