Transcriptomic dataset for Sardina pilchardus: Assembly, annotation, and expression of nine tissues

European sardine or pilchard is a planktonic small pelagic fish present from the North Sea in Europe to the coast of Senegal in the North of Africa, and across the Mediterranean sea to the Black Sea. Ecologically, sardines are an intermediary link in the trophic network, preying on plankton and being predated by larger fishes, marine mammals, and seabirds. This species is of great nutritional and economic value as a cheap but rich source of protein and fat. It is either consumed directly by humans or fed as fishmeal for aquaculture and farm animals. Despite its importance in the food basket, little is known about the molecular mechanisms involved in protein and lipid synthesis in this species. We collected nine tissues of Sardina pilchardus and reconstructed the transcriptome. In all, 198,597 transcripts were obtained, from which 68,031 are protein-coding. Quality assessment of the transcriptome was performed by back-mapping reads to the transcriptome and by searching for Single Copy Orthologs. Additionally, Gene Ontology and KEGG annotations were retrieved for most of the protein-coding genes. Finally, each library was quantified in terms of Transcripts per Million to disclose their expression patterns.


a b s t r a c t
European sardine or pilchard is a planktonic small pelagic fish present from the North Sea in Europe to the coast of Senegal in the North of Africa, and across the Mediterranean sea to the Black Sea. Ecologically, sardines are an intermediary link in the trophic network, preying on plankton and being predated by larger fishes, marine mammals, and seabirds. This species is of great nutritional and economic value as a cheap but rich source of protein and fat. It is either consumed directly by humans or fed as fishmeal for aquaculture and farm animals. Despite its importance in the food basket, little is known about the molecular mechanisms involved in protein and lipid synthesis in this species. We collected nine tissues of Sardina pilchardus and reconstructed the transcriptome. In all, 198,597 transcripts were obtained, from which 68,031 are protein-coding. Quality assessment of the transcriptome was performed by back-mapping reads to the transcriptome and by searching for Single Copy Orthologs. Additionally, Gene Ontology and KEGG annotations were retrieved for most of the protein-coding genes. Finally, each library was quantified in terms of Transcripts per Million to disclose their expression patterns.
© 2021 The Author(s

Value of the Data
• We present the Illumina sequencing effort and de novo transcriptome assembly of Sardina pilchardus , an important small pelagic fish due to its nutritional, economic, and ecological value. • This data will facilitate genome annotation and the discovery of genes of interest for the aquaculture industry. This resource could serve as the basis of a SNP chip that could differentiate the stocks of sardines across the Atlantic Ocean and the Mediterranean Sea. • The transcriptome, annotation, and expression patterns can be used to study the genes and pathways involved in ω-3 fatty acid synthesis and storage.
• The tissue quantification can be used to perform an RT-qPCR of a transcript of interest, using the tissue in which we know the target gene is active.
• Comparative evolutionary studies can be done to unravel the phylogenetic relationship of the sardine within the Clupeiformes or other teleost species. • Selection signatures can be identified by investigating functional differences between orthologous genes in sardines and other Clupeiformes species inhabiting different environments.

Data Description
This dataset contains the RNA-Seq analysis of nine tissues of Sardina pilchardus . Nine tissues from two female and one male sardines were dissected onboard and immersed immediately in RNAlater. Sequencing was performed using the Illumina HiSeq 20 0 0 platform, yielding 56 million single-stranded paired-end reads of length 101 base pairs, a median quality value per sequence of 37, 5.6 million reads per sample on average, resulting in a total of 5.70 Gbp ( Table 1 ). Reads were preprocessed with Trimmomatic, which slightly reduced the dataset to 98,09% of the reads, and the mean read length to 100.67 base pairs. Clean reads were assembled with Trinity. To measure the quality of the assembly, cleaned reads were back-mapped to the reference, and transcripts were searched for Actinopterygii Single-Copy Orthologs (SCOs). Transcripts were annotated with TransDecoder and Trinotate. Results of the sequencing effort and read cleaning are available in Table 1 , while the ones of assembly, quality control and annotation are in Table 2 . Fig. 1 shows the most frequent Gene Ontology annotations received, and the coverage of the metabolome based on the KEGG annotations. Finally, each library was quantified with kallisto and prepared for differential downstream analysis with sleuth to obtain the expression patterns for each transcript in every tissue. The raw reads for the nine tissues of Sardina pilchardus have been deposited at the European Nucleotide Archive, under the umbrella project PRJEB18441, while each experimental run is deposited under accession numbers ERR5925802 to ERR5925811 ( Table 1 ). To our knowledge, this is one of the widest datasets not only in Clupeiformes but also in fish in general, only surpassed by the ones in [1] . Supplementary data with the raw transcriptome assembly, predicted protein-coding sequences, transcript annotation and tissue quantification are available at Figshare under DOI 10.6084/m9.figshare.14617149 . It includes: the assembled transcriptome (sd01-assembly.fasta), the predicted coding-sequences (sd02-transdecoder.cds), annotation (sd03-trinotate.tsv) and expression profiles per tissue (sd04tpms.tsv).

Sampling strategy
Three individuals from the European Atlantic Ocean were collected by the IFREMER institute during the EVHOE scientific surveys (October 10th, 2015 [2] ). From these individuals, nine tissues (brain, eye, heart, kidney, liver, muscle, ovaries, skin, and testes) were dissected onboard, immediately immersed in RNAlater (Invitrogen), and stored at −20 °C until further processing.

RNA extraction, library construction, and sequencing
Total RNA from nine tissues ( Table 1 ) and three individuals were extracted using TriZol ® Reagent (Life Technologies) and quantified with Agilent 2100 Bioanalyzer combined with Agilent RNA 60 0 0 Nano chips (Agilent Technologies, Inc.) at the Gene Expression Unit (SGIker) of the University of the Basque Country UPV/EHU. Samples with RNA integrity numbers (RIN) below 8 were immediately discarded. For every tissue, the sample with the highest RIN was used for sequencing. The exception was testes since there was only one male specimen, and ovary, where both samples were used. A multiplex sequencing library was prepared by labeling each sample with specific 10-mer barcoding oligonucleotides. The barcoded RNA-Seq libraries were sequenced using the Illumina HiSeq 20 0 0 platform using one single lane. Sequencing reactions were performed with paired-end 101 bp and strand-specific protocol at the sequencing facility of the CNAG (Center Nacional d'Anàlisi Genòmica, Barcelona, Spain). Base-calling was performed using the Illumina native software.

Read processing, assembly and quality control
Raw reads were processed with Trimmomatic v0.33 [3] using a gentle procedure to remove adapters and low-quality bases, using the parameters 'SLIDINGWINDOW:4:5 LEADING:5 TRAIL-ING:5 MINLEN:25 . The trimmed reads were assembled with Trinity [4] , using default parameters with the exception that input reads were single-stranded to optimize the assembly. To understand the reliability of this assembly, a two-fold approach was used to study its completeness and representativeness. First, the transcriptome was analyzed by running BUSCO [5] against the Actinopterygii (ray-finned fishes) database. This software compares the transcriptome against a precomputed set of proteins conserved as Single-Copy Orthologs (SCOs) and returns how many of them are found, duplicated, fragmented or missing. Second, the representativeness of reference was obtained with Bowtie2 [6] .

Functional annotation and quantification
Functional annotation of the transcriptome was performed with the execution of the protein prediction software TransDecoder v5.0.2 [4] followed by the annotation of both transcripts and proteins with Trinotate v3.0.2 [7] .
TransDecoder translated each transcript into the six possible amino acid sequences and filtered out Open Reading Frames shorter than 300 nucleotides. Afterward, each candidate protein was queried against the SwissProt [8] and Pfam-A [9] databases (downloaded on 2018-10-22) and retained those hits with an E-value or domain noise cutoff less than or equal to 1e-5.
Subsequently, Trinotate was executed with default settings and using the same SwissProt and Pfam databases as before, and the same databases and threshold parameters for BLASTX, BLASTP, and hmmscan. Briefly, transcripts, predicted coding-sequences, and proteins are compared against the SwissProt and Pfam databases, and for each positive match, the source sequence inherits the annotation of its entry in its respective database. This way, sequences obtain Gene Ontology [10] and KEGG [11] . Annotations were obtained for 55,781 proteins from at least one database. Fig. 1 shows the Gene Ontology distribution of terms, and the parts of the metabolome covered, according to the KEGG annotation, and generated with the ggplot2 R package [12] , and IPath3.0 [13] , respectively.

Ethics Statement
Research complies with the ARRIVE guidelines and was conducted in accordance with the EU directive 2010/63/EU. IFREMER research vessels are under the supervision of the French Ministry of Education and Research. A steering committee evaluates and approves the campaign program.

Funding Information
We gratefully acknowledge funding from the Basque Government through a predoctoral grant (PRE_2017_2_0169) and from the Basque University System research group IT1233-19, "Applied Genomics and Bioinformatics". We also acknowledge funding from the IFREMER institute and by FFP (France Filière Pêche) through the project CAPTAIN.