Dataset from RNAseq analysis of bud differentiation in Ficus carica

The presented data regards the transcriptome profiling and differential analysis with RNA-Seq approach with the following goals: de novo transcriptome assembly and genome annotation of Ficus carica and the differential expression analysis of parthenocarpic and non-partenocarpic varieties in order to identify candidate genes for the production of seedless fig. Two fig varieties Dottato and Petrelli and the caprifig were grown at the fig repository at the ‘P. Martucci’ experimental station in Valenzano (Bari) of University of Bari ‘Aldo Moro’. The data included: RNA-seq data obtained from fruits of parthenocarpic and non-parthenocarpic varieties, gene expression in the different genetic materials; genes up and down regulated. The data in this article support information presented in the research article “I. Marcotuli, A. Mazzeo, P. Colasuonno, R. Terzano, D. Nigro, C. Porfido, A. Tarantino, R. Aiese Cigliano, W. Sanseverino, A. Gadaleta, G. Ferrara, Fruit Development in Ficus carica L.: Morphological and Genetic Approaches to Fig Buds for an Evolution From Monoecy Toward Dioecy. Front. Plant Sci.(2020) 11:1208. doi: 10.3389/fpls.2020.01208


a b s t r a c t
The presented data regards the transcriptome profiling and differential analysis with RNA-Seq approach with the following goals: de novo transcriptome assembly and genome annotation of Ficus carica and the differential expression analysis of parthenocarpic and non-partenocarpic varieties in order to identify candidate genes for the production of seed-  The genotypes on the NCBI database were reported as fiorone (harvested in April) and fico (harvested in July) [3] .

Value of the Data
• These data represent an added value on the bud differentiation process knowledge, which can be suitable for understanding what makes a bud developing into a main crop in the current year or enter dormancy and develop into a breba in the following season. • These data include additional information on genes expressed and up or down regulated during the bud development and differentiation. • These data can be included in the group of information, which can enrich the lack of info concerning bud differentiation mechanisms behind the different crops.

Objective
The fruits development of fig is very complex process, since there is a large variability among fig varieties including ones needing pollination and varieties that do not. Additionally, the "main crop" of certain genotypes could be separated in two sub-groups, the main crop, maturing in the period of July-September and the late "main crop", maturing in autumn and borne on the trees up to December. There are genotypes producing only the main crop that ripe late in the summer season. This "difference" of crops allowed to distinct varieties in uniferous (only main crop), biferous (two crops, breba, and main crop), and triferous (breba, summer, and late main crop) [ 5 , 6 ].
Fig genetic variability can be an interesting resource of genetic variation for breeding and for understanding the parthenocarpic production of figs.
In the present paper was presented the integrated pipeline obtained in order to produce a De novo transcriptome assembly and annotation of Ficus carica.

Data Description
The goal of the analysis was to improve and complete the already available F. carica annotation data by integrating different sources of information.
The repository database NCBI contains six folders, each one containing the raw sequence reads of Dottato, Petrelli and caprifig at the two timepoints. The entries are named using the abbreviation of the type of bud, the name of the genotype and the month of sample harvesting as following: FDA (Fiorone Dottato April), FDLb (Fico Dottato July), FPA (Fiorone Petrelli April), FPL (Fico Petrelli July), PRA (profig caprifig April) and MLb (mammone Caprifig July).
The Mendeley Data repository database contains two files, one with the newly obtained genome annotation (GTF file) and a second one with the sequences' Gene Ontology annotation in standard format file

Genome Annotation and RNA-seq Analysis
RNA sequencing experiment was performed on 6 samples (three of each genotype at two timepoits). Prior to further analysis, a quality check was performed on the raw sequencing data, removing low quality portions while preserving the longest high-quality part of NGS reads. The minimum length was set to 35 bp and the quality score to 25, using the software BBDuk ( Table 1 ). Quality of the reads was checked before and after the trimming step ( Fig. 1 ).

Mapping and Assembly Quality
RNA-seq reads were mapped against the reference genome sequence with STAR (version 2.5.0c) in local mode ( Table 2 ). Then, the reference-guided transcriptome assembly was performed with Trinity (v2.4.0). The number of obtained transcripts was 86,614 and the quality of the assembly was evaluated with different methods: -Transrate (v1.0.3), -BUSCO (v3), -cd-hit-est -STAR (version 2.5.0c). Due to the results obtained, the analysis was carried out using the longest isoforms (read below for more details about the quality results). The quality of the assembly was evaluated again with better quality results. Besides a new quality check was performed with Kallisto, to remove transcripts with no expression. Therefore, after filtering, about 50,866 transcripts were obtained.

Genome Annotation
Our assembled transcriptome was then merged with a set of transcripts produced by Liceth Solorzano Zambrano, et al. (2017) and used as input for the Maker pipeline.
At the same time an ab initio annotation was performed with Augustus which was also fed to Maker. Four iterations with Maker were performed to improve the Augustus model and finally new gene annotations were obtained. The BUSCO pipeline was then used to check the quality of the raw annotation.
A new annotation file (GTF) was obtained with the pipeline which was compared with the "NCBI" annotation by looking at the coordinates of the genes. The following rules were applied ( Fig. 2 ): • the genes appearing only in the NCBI annotation were always kept. The genes appearing only with the pipeline were analyzed by BLAST against a dataset of plant proteins and only those having a significant match were kept (read below for more details about the BLAST step); • the genes having a one-to-one match between NCBI and the pipeline, kept the NCBI structure; • the genes overlapping in a one-to-many way (i.e one NCBI gene matching more Maker genes or vice versa) were analyzed more in depth to understand which was the correct annotation. For this reason, two BLASTP were performed, blasting both the NCBI and the Maker genes against the TrEMBL Plantae and UniProt Plantae database. The results were processed with an in house-script with the following rules: • if a gene had no BLAST hit, it was removed; • the coordinates of the BLAST were processed to detect fusion or fragmentation events to keep the correct loci; • if genes from both the annotation had a hit, then the one with the highest coverage was kept.
The starting NCBI GTF included 36,138 genes, while the new GTF created file counted 35,567 (34,629 in common and 938 new genes).
Therefore, 1509 genes were removed because erroneously annotated based on the new pipeline and supported with Uniprot and TrEMBL database.
Besides, 938 were added as a new gene to the annotation. AHRD ( https://github.com/ groupschoof/AHRD ) was used to assign a description and a Gene Ontology annotation to the sequences.
Finally, the new annotation was evaluated with BUSCO. In order to show the significance of the analysis a new BUSCO Protein analysis was performed taking as reference the Plantae Database.

Counting
The version 1.4.6-p5 of FeatureCounts software and the new genome annotation were used to analyze gene expression values as raw read counts and to calculate normalized TMM and FPKM values.

Statistical Analysis
R packages HTSFilter and edgeR software were sued for all the statistical analyses executed., chosen In order to eliminate not expressed genes or ones showing too high variability, the HTS-Filter package was applayed implementing a filtering procedure for replicated transcriptome sequencing data based on a Jaccard similarity index. The "Trimmed Means of M-values" (TMM) normalization strategy was also used ( Fig. 3 ). The filter was applied to the different experimental conditions in order to identify and remove genes that appear to generate an uninformative signal.
The overall quality of the experiment was evaluated, on the basis of the similarity between samples, by a Principal Component Analysis (PCA) using the normalized gene expression values as input ( Fig. 4 ).  Differential expression analysis was achieved comparing the breba group against the main crop group used as reference allowing the detection of 3708 genes differentially expressed (1697 of them up-regulated and 2011 of them down-regulated) ( Fig. 5 ). MA and Volcano plot were also made ( Fig. 6 ). On one hand, the MA plot displayed the relationship between the average expression value (on the X-axis) and the fold change (Y-axis) for each gene analysed. The distribution of the dots in the MA-plot were suitable to check if the differentially expressed genes were equally distributed across the different ranges of expression values and the relationship with the fold-change. On the other hand, the Volcano plot showed the relationship between the fold-change (on the X-axis) and the significance of the differential expression test (Y-axis) for each gene in the genome ( Fig. 6 right). The distribution of the dots in the Volcano plot was used to detect the range of fold-changes associated with a stronger or a weaker significance of differential expression.

Gene Ontology Enrichment Analysis
GOEA, Gene Ontology Enrichment Analysis was performed to identify the most enriched Gene Ontology (GO) categories across the down-and up-regulated genes only for the significantly differentially expressed genes.

Ethics Statements
The work does not involve human subjects, animal experiments, or any data collected from social media platforms.